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Abstract 


The  emerging  popularity  of  multimedia  data,  as  digital  representation  of  text, 
image,  video  and  countless  other  milieus,  with  prodigious  volumes  and  wild  diver¬ 
sity,  exhibits  the  phenomenal  impact  of  modem  technologies  in  reforming  the  way 
information  is  accessed,  disseminated,  digested  and  retained.  This  has  iteratively 
ignited  the  data-driven  perspective  of  research  and  development,  to  characterize  per¬ 
spicuous  patterns,  crystallize  informative  insights,  and  realize  elevated  experience 
for  end-users,  where  innovations  in  a  spectrum  of  areas  of  computer  science,  in¬ 
cluding  databases,  distributed  systems,  machine  learning,  vision,  speech  and  natural 
languages,  has  been  incessantly  absorbed  and  integrated  to  elicit  the  extent  and  effi¬ 
cacy  of  contemporary  and  future  multimedia  applications  and  solutions. 

Under  the  theme  of  pattern  mining  and  similarity  querying,  this  manuscript 
presents  a  number  of  pieces  of  research  concerning  multimedia  data,  to  address 
an  array  of  practical  tasks  encompassing  automatic  annotation,  outlier  detection, 
community  discovery,  multi-modal  retrieval  and  learning  to  rank,  in  their  respective 
contexts  including  satellite  image  analysis,  internet  traffic  surveillance,  image  bioin¬ 
formatics,  and  Web  search.  A  repertoire  of  extant  and  novel  techniques  pertaining  to 
graph  mining,  clustering  analysis,  tensor  decomposition  and  probabilistic  graphical 
models  has  been  developed  or  adapted,  which  satisfactorily  met  differing  quality  and 
efficiency  requisites  postulated  by  specific  application  settings,  best  exemplified  by 
the  40  times  speed-up  in  annotating  satellite  images  and  the  up  to  30%  performance 
improvement  in  predicting  web  search  user  clicks,  yet  without  the  loss  of  generality 
to  similar  and  related  scenarios. 
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Chapter  1 
Introduction 


1.1  Motivation 

The  vast  and  sprawling  collection  of  digital  multimedia  data,  on  the  crescendo  of  the  Internet  era, 
has  manifested  unprecedented  bandwidth  and  efficacy  of  information  production  and  transmis¬ 
sion,  engendering  a  profound  enrichment  of  our  everyday  experience.  Its  proliferating  ubiquity 
could  be  aptly  illustrated  by  the  overwhelming  popularity  of  smart  phones,  with  increasingly 
powerful  capacity  and  elegant  design  for  Web  browsing,  image/video  recording  and  playback, 
location-based  services,  to  name  a  few,  as  well  as  an  almost  omnipresent  social  layer  upon  them, 
thereby  generating  and  exchanging  data  flows  in  a  miscellany  of  modes  that  extend  far  beyond 
those  from  call  making  and  text  messages  only  a  couple  of  years  ago. 

Web  search,  from  which  the  Internet  giant  Google  sprouted  and  thrived  in  the  previous 
decade,  serves  as  another  vivid  example,  with  most  lustrous  breakthroughs  in  this  decade  to 
be  much  likely  towards  mobile  search,  which  renders  mobile  content  more  “accessible  and  use¬ 
ful”,  plus  social  search,  which  “brings  friend  effect  to  search”,  and  also  universal  search,  which 
blends  regular  Web  results  with  a  variety  of  “verticals”  such  as  image,  videos,  news  articles, 
microblogging  feeds  and  quite  a  few  others. 

These  innovations,  with  gigantic  and  usually  unwieldy  multimedia  data  flows,  entail  a  de¬ 
mand  of,  and  would  unexcludingly  benefit  from,  novel  computational  approaches  to  navigate 
and  explore  the  space  of  information  mapped  from  multi-aspect,  and  typically  structured  records. 
To  satisfy  the  hunger  of  such  algorithmic  and  heuristic  tools,  this  manuscript  presents  a  set  of 
studies  in  an  attempt  to  yield  knowledge,  information  and  insight  into  multimedia  data  from  two 
perspectives  -  mining  and  querying. 


1.2  Mining  Multimedia  Data 

Instead  of  making  a  pronouncement  on  the  definition  of  multimedia  mining,  it  might  be  more 
informative  to  commence  with  a  sketch  through  concrete  examples  to  be  covered  at  length  in 
relevant  chapters  of  the  thesis: 
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Satellite  Imagery  -  Given  a  collection  of  high-resolution  satellite  map  images  spanning 
several  gigabytes,  each  of  which  is  divided  into  small  rectangular  or  hexagonal  tiles,  with 
a  few  of  them  labeled  a  priori  by  domain  experts  using  a  controlled  vocabulary,  how  to,  in 
an  efficient  way,  suggest  labels  for  all  remaining  tiles,  propose  refinements  of  the  labeling 
vocabulary  to  better  distinguish  these  tiles,  and  spot  “outlier  tiles”  that  are  dissimilar  with 
most  others? 

Network  Traffic  Log  -  As  part  of  a  network  measurement  study,  packet  traces  sent  from 
or  received  by  a  several-thousand-host  enterprise  network  during  an  over- 100-hour  time 
span  were  collected  to  generate  millions  of  records  which  log,  for  each  packet,  its  source, 
destination,  communication  port,  and  time  stamp.  The  task  of  interest  is  to  automatically 
and  swiftly  detect  possibly  suspicious  activities  to  be  reported  for  further  investigation. 

Web  Knowledge  Base  -  This  is  produced  by  an  automated  system  which  constantly  “reads 
the  web”  to  extract  facts  such  as  (Facebook,  Headquarteredln,  Palo_Alto).  Is  it  possible  to 
distill  some  knowledge  out  of  these  simple  and  flat  facts?  Is  it  possible  to  further  leverage 
these  facts  to  compose  intelligent  answers  to  questions  like  “tell  me  something  interesting 
about  George  Harrison”? 

Biological  Image  Collection  -  To  study  the  early  development  process  of  fruit  fly,  a  to¬ 
tal  of  more  than  70,000  embryo  images  were  captured  documenting  the  spatio-temporal 
expression  activities  of  selected  genes  during  the  first  24  hours  after  fertilization.  An  am¬ 
bitious  goal  is  to  identify  groups  of  genes  that  exhibit  correlated  patterns  over  time  and 
visualize  such  relationship  in  silico  to  assist  further  scientific  verification  and  discovery. 

Each  example  presented  precedingly  illuminates  a  scenario  relevant  to  pattern  mining  for  mul¬ 
timedia  data.  As  diverse  as  these  subjects  may  be,  they  do  share  a  single  feature:  data-driven 
problem  solving  over  multiple  modes  at  a  non-trivial  scale.  And  this  becomes  the  working  defi¬ 
nition  of  multimedia  mining  in  this  monograph,  based  on  which  we  will  strive  to  understand  the 
data  available  and  the  goals  set  in  each  context.  How  was  the  data  set  collected,  or,  put  another 
way,  how  was  the  measurement  taken  for  each  aspect  of  the  data?  What  are  the  connections 
amongst  different  data  modes,  as  well  as  attributes  within  the  same  mode?  How  to  transform  one 
or  more  modes  of  the  data  into  a  compact  and  viable  representation,  if  at  all  necessary?  How  to 
craft  algorithms  and  heuristics  accordingly  that  attain  appropriate  trade-off  between  quality  and 
computational  complexity?  How  to  design  numerical  experiments  to  evaluate  proposed  methods 
with  confidence?  The  second  part  of  the  dissertation  will  be  devoted  to  address  the  forerunning 
list  of  questions. 


1.3  Querying  Multimedia  Data 

A  substantial  part  of  the  information  seeking  behavior,  especially  from  Internet  users,  is  pertinent 
to  similarity  querying,  which  facilitates  the  exploration  to  broaden  their  scope  of  knowledge,  and 
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Figure  1.1:  Examples  of  query  interfaces  which  (a)  either  explicitly  solicits  input,  (b)  or  adopts 
a  light-weight  manner  to  infer  users’  preferences  based  on  their  profiles  and  impresses  results 
devoid  of  typing. 


Figure  1.2:  A  graphical  representation  for  user-page  recommendation. 


possibly  arouses  secondary  desire  and  curiosity  for  more  discoveries.  The  latter  has  been  critical 
to  boost  up  engagement  metrics  for  web  sites  like  Facebook  and  Amazon  by  keeping  users  onsite, 
both  of  which  have  a  rich  set  of  multimedia  contents  to  offer. 

A  querying  system  for  such  multi-aspect  data  provides  an  interface  to  retrieve  records  within 
and  across  modes  that  best  match  users’  information  need.  Figure  1.1(a)  depicts  an  interface 
explicitly  requesting  user  inputs  to  query  the  Web,  whereas  Figure  1.1(b)  sits  at  the  other  end  of 
the  gamut  which  infers  users’  interests  based  on  their  profiles  and  past  activities,  not  uncommon 
in  nowadays  online  recommendation  applications. 

A  key  algorithmic  part  of  the  querying  system  is  the  derivation  of  a  quantitative  similarity 
measure.  For  the  page  recommendation  example,  a  graphical  representation  is  illustrated  in 
Figure  1.2,  with  two  layers  of  nodes,  users  and  pages,  as  well  as  inter-layer  links  indicating  the 
user  becomes  a  fan  of  the  page.  The  intuition  about  closeness  is  that  when  two  pages  have  a 
larger  fraction  of  overlapping  fan  bases,  e.g.,  pages  “FCB”  and  “ACM”  compared  with  any  other 
combinations,  they  are  similar  to  each  other  and  thereby  closer  to  each  other’s  existing  fans; 
iteratively,  if  a  pair  of  users  have  quite  a  few  pages  of  common  interest,  one’s  favorite  pages  could 
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be  recommended  to  the  other  if  she  is  not  yet  a  fan.  This  notion  could  be  reflected  by  a  measure  of 
proximity  over  graphs  by  performing  random  walk  with  restarts  [57]  and  computing  the  steady- 
state  probability,  which  could  be  obtained  efficiently  even  for  million-node  graphs  [108]. 

On  an  additional  note,  the  graph  transformation  itself  could  be  non-trivial.  With  regards  to  the 
previous  user-page  graph,  intra-layer  edges  could  be  introduced  between  users  who  are  friends 
in  a  social  network,  and  edge  weights  may  be  further  assigned  according  to,  say,  the  number  of 
mutual  friends  they  share  and  the  frequency  they  are  involved  in  the  same  thread  of  posts  and 
comments.  Pages  may  bear  a  categorical  attribute  such  as  people,  sports,  movies,  and  alike,  this 
could  be  made  an  additional  layer  of  nodes  or  be  incorporated  in  the  decision  to  link  pairs  of 
nodes  within  the  page  layer. 

Armed  with  the  foregoing  idea  to  formulate  the  querying  problem  into  graph  search,  we 
present  the  implementation  of  an  online  interface  to  offer  cross-modal  search  capacity  for  an 
annotated  biological  image  database  in  the  opening  chapter  of  the  third  part  of  the  dissertation. 
The  succeeding  chapter,  in  a  different  vein,  studies  multimedia  corpora  where  each  querying  unit 
itself  is  represented  as  features  from  more  than  one  modes  bearing  quite  different  semantics.  A 
family  of  probabilistic  graphical  model  is  introduced  as  the  solution,  which  provides  both  flex¬ 
ibility  to  accommodate  heterogeneous  numerical  features  and  interpretability  by  summarizing 
concepts  or  themes  from  the  data  and  relating  them  to  perceivable  entities  (e.g.,  words  and  im¬ 
ageries),  a  highly  desirable  property  for  practitioners.  The  last  topic  of  this  part  is  belated  to  user 
interaction  with  a  querying  system,  given  the  name  “user  implicit  feedback”.  Statistical  models 
abstracting  users’  behavior  as  they  examine  the  list  of  querying  results  and  issue  clicks  along  the 
way  are  proposed,  with  the  ambition  to  improve  future  ranking  for  elevated  user  experience. 


1.4  Thesis  Organization 

The  body  chapters  of  the  thesis  are  organized  into  two  parts.  The  part  that  comes  first  consists  of 
the  following  two  chapters  addressing  mining  problems: 

•  QMAS:  low-labor  labeling,  representative  finding,  and  singling  out  anomalies  for  satellite 
images  and  other  image  collections.  The  proposed  algorithm  yields  a  40x  speed  up  over 
the  baseline  without  loss  of  labeling  quality.  This  study  is  presented  in  Chapter  2  and  was 
published  as  [30]. 

•  MultiAspectForensics :  mining  heterogeneous  network  data  using  tensor  decomposition 
with  applications  in  cyber-security  surveillance  and  pattern  discovery  in  structured  knowl¬ 
edge  base.  Two  novel  types  of  subgraph  patterns  were  discovered  from  data  sets  across 
domains.  This  study  is  presented  in  Chapter  3  and  was  published  as  [75]. 

The  theme  of  the  later  part  is  querying  and  it  spans  over  following  three  chapters: 

•  CDEM :  an  online  query  interface  for  Drosophila  embryo  image  databases.  It  supports 
cross-modal  queries  over  a  large  database  which  consists  of  genes,  embryo  images  docu¬ 
menting  gene  expression,  and  image  annotations.  This  study  is  presented  in  Chapter  4  and 
was  published  as  [48]. 
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•  BEFH :  a  Bayesian  approach  to  inference  and  learning  with  a  family  of  undirected  graphical 
model  for  classification  and  retrieval  in  multimedia  corpora.  This  study  is  presented  in 
Chapter  5  and  was  published  as  [47]. 

•  Click  Models :  learning  to  rank  from  Web  search  click  data  by  defining  and  estimating  user- 
perceived  relevance  of  search  results.  The  highlighted  model  provides  an  easy  and  efficient 
solution  to  account  for  the  position-bias  inherent  in  the  data,  and  achieves  7%  improvement 
over  the  baseline  as  measured  by  a  popular  log-likelihood  metric,  and  30%  performance 
boost  when  predicting  last  click  positions.  This  study  is  presented  in  Chapter  6  and  was 
published  as  [51]. 

Figure  1.3  provides  a  summary  of  data  sets  as  well  as  how  they  are  referred  by  aforemen¬ 
tioned  studies. 


Chap. 

Name 

Type 

Size 

Description 

2 

GeoEye 

Satellite 

Imagery 

17MB 

14  high  quality  images  of  cities  around  the 
world,  divided  into  14,336  rectangular  tiles. 

2 

SAT1.5GB 

Satellite 

Imagery 

1.5GB 

3  huge  satellite  images  in  GeoTIFF  format, 
divided  into  721,408  rectangular  tiles. 

2 

SATLARGE 

Satellite 

Imagery 

1.8GB 

A  4-band  multi-spectral  proprietary  image, 
divided  into  2.57M  hexagonal  tiles. 

3 

LESNL 

Network 
Traffic  Log 

281K 

non-zero 

elements 

4-mode  data  representing  packet  traces 
recorded  on  servers  within  the  Lawrence 
Berkeley  National  Lab. 

3 

RTW 

Web 

Knowledge 

Base 

10K 

non-zero 

elements 

3-mode  triplet  data  from  the  NELL  system  at 
Carnegie  Mellon  University  such  as 
(Pittsburgh,  city-located-in-state,  pa). 

3,4 

BDGP 

Bioimage 

Data 

38K 

non-zero 

elments 

3-mode  data  of  images  documenting  the 
expression  patterns  of  genes  in  the  early 
development  of  Drosophila. 

5 

TRECVID 

Multimedia 

Corpora 

1,078 

video 

clips 

Obtained  from  compiled  TRECVID'03  news 
video  collection  with  a  total  of  1,894  word 
features  and  166  image  features  extracted. 

6 

Click  Log 

Web  Search 

Click  Data 

8.8M 

query 

sessions 

Impressions  and  clicks  for  top-10  positions 
sampled  from  a  major  commercial  search 
engine  in  July  2008. 

Figure  1.3:  A  summary  of  data  sets  included  in  this  manuscript. 
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Part  II 


Mining  Multimedia  Data 


Chapter  2 

QMAS :  Querying,  Mining  And 
Summarization  of  Multi-modal  Databases 


Given  a  large  collection  of  images,  very  few  of  which  have  labels,  how  can  we  guess  the  labels 
of  the  remaining  majority,  and  how  can  we  spot  those  images  that  need  brand  new  labels,  distinct 
from  the  existing  ones?  Current  automatic  labeling  techniques  usually  scale  super  linearly  with 
the  data  size,  and/or  they  fail  when  the  amount  of  labeled  data  is  very  limited.  In  this  chapter, 
we  introduce  QMAS  (Querying,  Mining  And  Summarization  of  multi-modal  databases),  a  fast 
solution  to  the  following  problems: 

•  low-labor  labeling  -  given  a  collection  of  images,  very  few  of  which  are  labeled  with 
keywords,  find  the  most  suitable  labels  for  the  remaining  ones; 

•  mining  and  attention  routing  -  in  a  similar  setting,  find  clusters  and  representative  images 
for  each  cluster,  as  well  as  a  set  of  outlier  images. 

This  chapter  is  based  upon  the  work  published  in  [30]  and  [31].  The  rest  of  this  chapter  is  orga¬ 
nized  as  follows:  we  start  with  an  overview  of  the  proposed  approach  in  Section  2.1,  followed  by 
discussion  of  related  work  in  Section  2.2.  Algorithmic  details  are  presented  in  Section  2.3,  and 
empirical  results  are  covered  by  Section  2.4.  Section  2.5  concludes  the  chapter. 


2.1  Overview 

The  problem  of  automatically  analysis,  labeling  and  understanding  large  collections  of  images 
appears  in  numerous  fields.  Our  driving  application  is  related  to  satellite  imagery,  involving  a 
scenario  in  which  a  topographer  wants  to  analyze  the  terrains  in  a  collection  of  satellite  images. 
We  assume  that  each  image  is  divided  into  smaller  tiles  (say,  16x16  pixels).  Such  a  user  would 
like  to  make  the  effort  to  create  labels  (e.g.,  Water,  Concrete,  Trees,  etc )  for  only  a  small  number 
of  tiles,  and  then  expect  an  automatic  algorithm  to  generate  labels  for  all  the  rest.  The  user  would 
also  like  to  know  what  pieces  of  land  exist  in  the  analyzed  regions  look  “strange”,  not  matching 
any  of  the  known  labels,  since  they  may  indicate  anomalies  (e.g.,  de-forested  areas,  potential 
environmental  hazards,  etc.),  or  errors  in  the  data  collection  process.  Finally,  the  user  would  like 
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Labels:  ■  City  □  Urban  Trees  □  Water  ■  Forest 
□  Not  labeled  /  outlier 

Figure  2.1:  An  illustration  of  labeling  results  from  the  proposed  algorithm.  Left:  the  input 
satellite  image  of  the  city  of  Annapolis,  divided  in  1,  024  (32x32)  tiles,  only  four  of  which  are 
labeled.  Right:  suggested  labels  from  QMAS',  yellow  indicates  outliers  which  are  likely  to  repre¬ 
sent  “Bridge”. 


to  have  a  few  tiles  that  best  represent  each  kind  of  terrain. 

Such  requirements  are  not  only  limited  to  satellite  image  analysis  but  also  arise  in  several 
other  applications  including  medical  image  and  biological  image  pattern  analysis.  For  instance, 
a  doctor  wants  to  find  tomographies  similar  to  the  images  of  his/her  patients  as  well  as  a  few 
examples  that  best  represent  both  the  most  typical  and  the  most  strange  image  patterns  [32,  66], 
or  a  biological  expert  with  a  set  of  imaging  data  such  as  the  expression  pattern  in  the  early 
development  of  fruit  fly  embryos  [106]  may  need  a  system  to  answer  a  similar  set  of  questions. 

Our  goals  are  summarized  in  two  research  problems: 

•  low-labor  labeling  -  given  a  collection  of  images,  up  to  a  few  of  which  are  labeled  a  priori, 
find  the  most  appropriate  labels  for  remaining  ones. 

•  mining  and  attention  routing  -  in  a  similar  setting,  find  clusters,  a  set  of  images  that  best 
represent  the  data  patterns  and  another  set  which  consists  of  top  outliers  which  stand  out 
from  existing  patterns  with  labels. 

Figure  2.1  illustrates  the  input  and  output  of  the  proposed  algorithm  for  low-labor  labeling. 
The  satellite  image,  public  available  at  www .  geoeye  .  com,  displays  part  of  the  city  of  An¬ 
napolis,  MD,  USA.  We  split  it  into  1,  024  (32x32)  tiles,  very  few  (four)  of  which  were  manually 
labeled  as  “City”  (red),  “Water”  (cyan),  “Urban  Trees”  (green)  or  “Forest”  (black),  as  shown  on 
the  left  figure.  QMAS  automatically  produced  the  label  and  results  are  shown  on  the  right.  A 
vast  majority  of  tiles  are  correctly  labeled,  and  a  few  outlier  tiles,  highlighted  in  yellow,  are  also 
picked  up,  with  the  implication  that  they  possibly  deserve  new  label(s)  of  their  own.  A  closer  in- 
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spection  in  this  example  concluded  that  the  outlier  tiles  tend  to  lie  on  the  border  between  “Water” 
and  “City”,  and  are  likely  to  contain  a  bridge. 

For  the  problem  of  mining  and  attention  routing ,  we  take  the  approach  of  finding  clusters  in 
the  data  without  the  information  from  the  user-provided  labels  at  the  beginning.  The  pure  image- 
based  results  may  be  aligned  and  compared  with  the  evidence  from  existing  labels,  possibly 
leading  to  suggestions  for  refinement  such  as  merging  too  specific  labels  that  are  difficult  to  be 
differentiated  in  practice  (e.g.,  “Forest”  and  “Urban  Trees”),  and/or  splitting  too  general  labels 
of  which  tiles  are  not  quite  alike  (e.g.,  “Shallow  Water”  and  “Deep  Sea”  could  replace  a  single 
label  of  “Water”).  Another  advantage  it  offers  is  that  it  enables  group  labeling  -  the  labeling  unit 
could  be  a  cluster  instead  of  a  single  tile. 

Table  2.1  provides  a  list  of  most  common  acronyms  appearing  in  this  chapter. 


Table  2.1:  Summary  of  Acronyms 


Acronym 

Explanation 

ANN 

GBT 

GCap 

MrCC 

RWR 

ViVo 

Approximate  Nearest-Neighbor,  an  algorithm  for  fast  nearest-neighbor  searching 
Generalized  Balanced  Ternary,  a  hexagonal  mathematical  system  for  feature  extraction 
Graph-based  automatic  image  captioning,  the  baseline  image  annotation  algorithm 
Multi-resolution  Correlation  Cluster  detection,  a  scalable  subspace-clustering  method[32] 
Random  Walk  with  Restart  for  establishing  proximity  between  pairs  of  nodes  in  graph 
Visual  Vocabulary,  an  algorithm  to  group  image  tiles  into  a  set  of  visual  terms 

2.2  Related  Work 

2.2.1  Image  Labeling 

There  is  an  extensive  body  of  work  on  the  classification  of  unlabeled  regions  from  partially 
labeled  images  in  the  field  of  computer  vision,  such  as  image  segmentation  and  region  classifica¬ 
tion  [46,  69,  98,  1 10].  The  conditional  random  fields  (CRF)  and  boosting  approach  in  [98]  shows 
the  competitive  accuracy  for  multi-class  classification  and  segmentation,  but  it  is  relatively  slow 
and  requires  a  lot  of  training  examples  to  get  started.  The  random  walk  segmentation  method 
in  [46]  is  closely  related  to  our  work,  but  scalability  is  beyond  the  scope  of  that  work  since  it 
is  concerned  with  the  segmentation  of  a  single  image.  The  KNN  classifier  in  [110]  may  be  the 
fastest  way  for  region  labeling,  but  it  may  be  sensitive  for  outliers.  The  empirical  Bayes  approach 
in  [69]  is  able  to  leam  contextual  information  from  unlabeled  data.  However,  it  may  be  difficult 
to  apply  to  satellite  image  sets. 

Graph-based  methods  provide  a  flexible  tool  for  automatic  image  captioning.  Images  and 
caption  keywords  are  represented  by  multiple  layers  of  nodes  in  a  graph.  Image  content  simi¬ 
larities  are  captured  by  edges  between  image  nodes,  and  existing  image  captions  become  links 
between  corresponding  images  and  keywords.  Such  techniques  have  been  previously  used  in 
GCap  [84],  in  which  a  tri-partite  graph  was  built  based  on  captioned  images,  further  segmented 
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into  regions.  Given  an  image  node  of  interest,  the  Random  Walk  with  Restart  (RWR)  algorithm, 
which  resembles  semi-supervised  label  propagation  on  graphs  [120],  was  used  to  perform  prox¬ 
imity  query  to  automatically  find  the  best  annotation  keyword  for  each  region.  RWR  is  usually 
computed  using  the  power  iteration  method,  which  converges  in  a  few  iterations  in  most  cases. 
Another  alternative  algorithm  for  labeling  is  the  transductive  support  vector  machine,  which  has 
been  shown  to  be  efficient  and  accurate  for  data  which  comes  with  very  high  dimension  and 
sparse  features  like  word  counts  [59].  In  the  satellite  imagery  we  studied  in  this  chapter,  the 
number  of  dimensions  does  not  go  beyond  a  few  dozen  and  certain  features  may  be  irrelevant  to 
the  labeling  class. 

To  create  edges  between  similar  image  nodes,  most  previous  work  searches  for  nearest  neigh¬ 
bors  in  the  image  feature  space.  However,  this  operation  is  super-linear  even  with  the  speed  up 
offered  by  many  approximate  nearest-neighbor  finding  algorithms  (e.g.,  the  ANN  Library  [80]). 
Given  millions  of  image  tiles  in  satellite  image  analysis,  greater  scalability  is  almost  mandatory. 

2.2.2  Clustering 

Most  clustering  algorithms  assume  the  following  cluster  definition:  a  cluster  is  a  region  in  the 
feature  space  in  which  the  objects  are  dense.  This  region  may  have  an  arbitrary  shape,  and  the 
points  inside  it  may  be  arbitrarily  distributed.  A- means  like  methods  start  by  picking  k  points 
in  the  metric  space  as  cluster  centers,  or  centroids,  through  a  random  process  or  by  applying 
some  specific  heuristics  for  this  task.  The  clustering  is  made  possible  by  an  iterative  process  that 
assigns  objects  to  their  closest  centroids,  and  iteratively  improving  the  centroids  according  to  the 
objects  assigned  to  each  cluster.  The  computation  stops  when  a  quality  criterion  is  satisfied  or 
when  a  maximum  number  of  iterations  is  achieved.  An  example  of  this  approach  is  K-Harmonic 
Means  [117].  The  main  drawbacks  of  this  approach  are  that  it  assumes  that  the  clusters  have 
hyper-spherical  shapes  in  the  data  space  and  that  the  number  k  of  clusters  should  be  determined 
by  the  user  a  priori. 

The  Visual  Vocabulary  (ViVo)  [14]  method  is  particularly  useful  for  our  work.  ViVo  is  a  novel 
approach,  proposed  for  the  analysis  of  biomedical  images,  that  applies  Independent  Component 
Analysis  (ICA)  to  group  image  tiles  into  a  set  of  visual  terms,  avoiding  subtle  problems,  such  as 
non-Gaussianity. 

2.2.3  Feature  Extraction 

Feature  extraction  is  generally  considered  to  be  a  low-level  image  processing  task  and  is  closely 
related  to  feature  detection.  Histogram-based  features  are  perhaps  the  simplest  and  most  popular 
type  of  features.  Texture-based  features  such  as  wavelets  and  fractals  are  able  to  capture  more 
subtle  spatial  variations  such  as  repetitiveness.  Local  feature  descriptors  such  as  SIFT  [74]  and 
SURF  [12]  have  also  been  widely  used,  as  well  as  the  Generalized  Balanced  Ternary  (GBT)  [44], 
a  hexagonal  mathematical  system  that  allows  feature  extraction.  A  recent  example  of  GBT’s 
usage  in  target  recognition  is  found  in  [45].  The  choice  of  candidate  features  is  usually  domain- 
specific  and  may  also  be  subject  to  scalability  constraints  in  large  scale  analysis.  The  feature 
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DigitalGlobe  Imagery 


Tasseled  cap 
transformation 

5-band  combined 


Figure  2.2:  Pre-processing  applied  to  multi-band  satellite  imagery.  Best  viewed  in  colors.  Left: 
sample  input  multi-band  image;  Right:  the  resulting  5-band  composite  image  for  which  features 
are  computed. 

extraction  procedures  applied  in  our  study  will  be  introduced  in  the  next  section. 

2.3  Proposed  Method 

In  this  section  we  discuss  algorithmic  details  of  QMAS,  starting  with  the  feature  extraction  pro¬ 
cedure  to  obtain  a  compact  yet  informative  representation  of  image  pixels.  Then  procedures  for 
mining  and  attention  routings  are  introduced,  which  include  clustering,  representative  identifica¬ 
tion  and  outlier  discovery.  The  low-labor  labeling  technique  is  presented  in  sequel. 

2.3.1  Feature  extraction 

Two  approaches  feature  extraction  were  employed  for  different  data  sets.  For  public  available 
image  collections,  we  obtained  Haar  wavelets  in  two  resolution  levels,  plus  the  mean  value  of 
each  band  of  the  images.  For  proprietary  image  collections,  we  applied  an  alternative  approach: 
first,  there  was  a  pre-processing  step  resulting  in  a  5-band  composite  image  as  illustrated  in 
Figure  2.2.  The  first  four  bands  are  the  4-band  tasseled  cap  transformation  (TCT)  of  4-band 
multi-spectral  data,  and  the  fifth  band  is  the  panchromatic  band. 

Feature  generating  following  this  second  approach  utilizes  a  variety  of  characteristics,  in¬ 
cluding  statistical  measures,  gradients,  moments,  and  textual  measures.  For  multi-scale  image 
characterization,  which  is  crucial  for  finding  patterns  at  various  resolutions,  we  adopted  General¬ 
ized  Balanced  Ternary  (GBT).  We  mapped  the  raster  pixel  data  into  the  GBT  space  and  calculated 
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Figure  2.3:  GBT  structure  illustrated.  Left:  two  levels  of  GBT  cells  with  343  pixels  (Level-3, 
outlined  in  white)  and  2,401  pixels  (Level-4,  outlined  in  red)  overlaid  on  an  image.  Right:  output 
values  were  assigned  according  to  the  variance  of  the  adjacent  lower  level  of  cells  (at  Level-2, 
consisting  of  49  pixels  each).  Bright  areas  have  greater  variance,  dark  areas  have  less. 

a  set  of  moments-based  features  over  the  multi-scale  hierarchy  of  GBT  cells.  The  GBT  structure 
is  such  that  any  cell  or  aggregate  at  a  given  layer  in  the  hierarchy  contains  seven  hexagonally 
grouped  aggregates  or  hexagons  (if  at  the  pixel  level)  in  the  layer  below  it.  The  cells  form  a 
hexagonal  tiling  of  the  pixels  at  a  variety  of  scales,  effectively  describing  the  image  in  multiple 
resolutions.  A  sample  of  GBT  structure  and  simple  computations  is  shown  in  Figure  2.3. 

Image  features  such  as  mean,  variance,  and  GBT  texture  are  calculated  for  GBT  aggregates 
in  each  of  the  five  bands  of  data.  The  final  feature  set  comprises  a  30-dimensional  feature  vector 
per  aggregate:  mean,  variance,  and  GBT  texture  of  the  Level-n  aggregate  in  each  of  the  five  data 
bands  plus  the  mean,  variance,  and  GBT  texture  of  the  Level-n  +  1  aggregate  centered  at  that 
Level-n  position  in  each  of  the  five  data  bands. 

Following  this  feature  extraction,  we  adapted  the  ViVo  method  [14]  to  group  image  tiles  into 
a  set  of  visual  terms,  with  a  slight  modification  to  incorporate  GBT  aggregate  features.  If  a  tile 
cannot  be  represented  by  the  vocabulary  already  known  to  ViVo,  then  it  will  automatically  derive 
a  new  type  of  tiles  (represented  by  a  new  visual  term),  as  needed.  These  new  types  represent 
natural  groupings  of  tiles  in  the  feature  space  and  indicate  where  new  labels  could  potentially 
improve  the  accuracy  of  QMAS. 

2.3.2  Mining  and  Attention  Routing 

Clustering 

We  start  by  clustering  image  tiles  and  subsequently  determine  representatives  and  outliers  based 
on  the  output.  The  clustering  step  over  the  set  of  images  /  is  performed  by  a  modified  version  of 
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the  MrCC  algorithm.  As  described  in  Section  2.2,  MrCC  is  a  fast  subspace  clustering  algorithm 
designed  to  look  for  clusters  in  large  collections  of  medium-dimensionality  data.  The  original 
MrCC  algorithm  is  composed  of  three  main  steps:  (i)  data  structure  construction;  (ii)  identifica¬ 
tion  of  initial  clusters,  named  /3-clusters,  which  are  axis-parallel  hyper-rectangles  with  high  data 
densities;  and  (iii)  overlapping  /3-clusters  merging.  Here  we  bypass  the  third  step  to  obtain  “soft” 
clusters,  where  a  single  tile  can  belong  to  one  or  more  clusters,  such  as  “Trees”  and  “Water”. 

Finding  Representatives 

Now  we  focus  on  the  problem  of  selecting  a  set  of  elements  R,  of  size  NR  specified  a  priori , 
to  represent  a  given  set  of  images  I .  The  set  of  representatives  R  should  have  the  following 
property:  for  every  image  I,  in  /  there  should  be  a  representative  Rr  from  R  that  is  mostly  similar 
to  R.  Assuming  that  these  images  are  already  embedded  in  a  metric  space,  a  straightforward 
optimization  goal  is  to  minimize  the  sum  of  squared  distances  between  each  image  I,  and  its 
closest  representative  Rr.  The  solution  is  simply  the  popular  clustering  algorithm  K- means. 
However,  the  method  is  known  to  be  sensitive  to  skewed  distributions,  data  imbalance,  etc,  which 
is  not  uncommon  for  our  use  case  in  studying  the  satellite  imagery.  Here  we  propose  to  optimize 
the  following  dis-similarity  function  instead: 

Eqmas(I,  R)  =  J2  ,  (2-1) 

hei  ^j= i  Wh-RjW2 

Therefore,  instead  of  focusing  on  the  minimum  of  distance  between  the  target  image  and  repre¬ 
sentatives,  here  the  harmonic  mean  is  the  concern,  which  is  usually  more  robust  to  extreme  data 
distributions  and  unfortunate  initializations.  The  solution  to  this  metric  is  known  as  K-harmonic 
means  [117]. 


Finding  Outliers 


The  goal  in  this  part  is  to  find  an  ordered  list  of  potential  outliers,  images  of  I  that  diverge  most 
from  main  data  patterns.  We  take  the  representatives  found  in  the  previous  section  as  the  basis 
for  the  outliers  definition,  i.e.,  assuming  that  a  set  of  representatives  R  is  a  good  summary  of 
/,  the  N0  images  from  /  worst  represented  by  R  are  said  to  be  the  top-iVo  outliers.  Formally, 
QMAS  uses  the  harmonic  mean  of  the  squared  distances  between  an  image  I,  and  each  one  of  the 
representatives  in  R  to  measure  the  quality  of  the  representation  of  /., .  therefore  the  top  outlier  is 
identified  by 


arg  max 

i 


nr 

SPNR  1 

^3= 1  Ilh-Rill2 


(2.2) 


2.3.3  Low-labor  Labeling 

Our  approach  is  to  represent  input  images  and  labels,  together  with  the  image  clusters  found 
before,  in  a  graph  G,  known  as  the  knowledge  graph.  A  random  walk-based  algorithm  is  applied 
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over  G  to  find  the  most  appropriate  labels  for  the  unlabeled  images.  Algorithm  1  shows  a  sketch 
of  our  solution,  and  details  are  given  in  the  remainder  of  this  subsection. 


Algorithm  1  :  QMAS  labeling. 

Input:  collection  of  images  / ; 

collection  of  known  labels  L; 
restart  probability  c; 
clustering  output  C. 

Output:  full  set  of  labels  LF. 

1:  use  I,  L  and  C  to  build  the  graphical  representation  6'; 

2:  for  each  unlabeled  image  It  E  I  do 
3:  random  walk  over  graph  G  from  vertex  V (I,),  and  with 

probability  c,  restart  the  walk  from  V  (/,;)  again; 

4:  compute  the  affinity  between  each  label  of  L  and  Ii  using  power  iteration  method  imple¬ 

menting  a  revised  random  walk  with  restart  algorithm; 

5:  let  Li  be  the  one  with  the  largest  affinity  value  in  set  in  L.  then  L F,  <—  Lf, 

6:  end  for 
7:  return  LF\ 


G  is  a  tri-parti te  graph  that  consists  of  the  vertex  set  V  and  the  edge  set  E.  V  is  made  up 
of  three  layers  corresponding  to  the  input  images  I,  the  clusters  of  images  C,  obtained  with 
algorithms  described  in  Section  2.3.2,  and  the  set  of  image  labels  L.  Vertices  in  G  that  represent 
image  I,  and  label  L/  are  denoted  by  V(I,)  and  V(Li),  respectively.  It  is  self-evident  that  with 
clustering  results  in  hand,  the  construction  process  of  such  a  graph  G  is  linear  in  time  and  space. 
Figure  2.4  exemplifies  a  toy  graph  with  seven  images,  two  distinct  labels,  and  three  clusters. 
Image  / 1  is  pre-labeled  with  L\,  while  J4  and  /7  are  both  pre-labeled  with  L2.  Note  that  under 
the  soft  clustering  scheme  we  adopted,  an  image  node  may  be  linked  with  more  than  one  nodes 
representing  clusters,  e.g.,  J3  is  connected  to  both  C\  and  C2. 

Given  an  unlabeled  image  1,,  we  apply  the  following  algorithm  over  the  graph  G  in  order 
to  find  an  affinity  score  for  each  possible  label  with  respect  to  I,;  it  is  imitating  the  well-known 
random  walk  with  restart  algorithm  with  a  minor  modification  on  selecting  the  random  neighbor 
to  land  on.  The  random  walker  starts  from  vertex  V (I,)  initially.  At  each  time  step,  the  walker 
always  takes  one  of  the  following  two  choices:  (1)  go  back  to  V  (I,.),  with  probability  c;  (2)  walks 
to  a  neighboring  vertex,  with  probability  1  —  c.  Under  the  latter  case,  the  probability  of  choosing 
a  neighboring  vertex  is  proportional  to  the  degree  of  that  vertex,  i.e.,  the  walker  favors  smaller 
clusters  and  more  specific  labels.  The  value  of  c  is  usually  set  to  an  empirical  value  (e.g.,  0.15),  or 
determined  by  cross-validation.  The  affinity  score  for  Li  with  respect  to  I,  is  given  by  the  steady 
state  probability  that  our  random  walker  will  find  himself  at  vertex  V(L{),  always  restarting  at 
V (G).  The  label  with  the  largest  score  becomes  the  recommended  label  for  /,. 

The  intuition  behind  this  procedure  is  that  similar  images  that  belong  to  the  same  cluster 
should  share  similar  labels.  This  is  consistent  with  our  graph  proximity  measure  which  favors 
multiple  short  paths  between  the  two  vertices  of  interest.  For  instance,  consider  image  J6  in 
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Figure  2.4:  The  knowledge  graph  for  a  toy  input  set.  Nodes  shaped  as  squares,  circles,  and 
triangles  represent  images,  labels,  and  clusters  respectively. 


Figure  2.4.  It  belongs  to  clusters  C2  and  C3.  The  other  two  images  in  C:i  bear  label  L2,  whereas 
none  of  the  images  in  C2  is  labeled.  Hence  there  is  a  higher  probability  that  a  random  walker 
starting  from  V (J6)  will  reach  V (L2)  than  V ( L  \ ) ,  in  that  there  are  two  shortest  paths  of  length 
3  linking  V(I6)  and  V(l2),  whereas  the  only  shortest  path  connecting  V(I6)  to  V(L1)  takes  as 
many  as  5  steps.  Moreover,  the  affinity  score  for  L2  could  be  higher  if  J6  were  associated  with 
C;>  only.  Thus,  for  larger  graphs,  in  which  it  is  not  untypical  that  an  image  belongs  to  multiple 
clusters,  the  membership  with  a  smaller  cluster  takes  more  weight  than  that  with  a  larger  one. 


2.4  Experimental  Results 

2.4.1  Experimental  Setup 

We  first  introduce  three  data  sets  of  satellite  images  that  serve  at  the  test  beds  in  this  section: 

•  GeoEye  -  14  high  quality  satellite  images  in  JPEG  format  of  a  number  of  cities  around  the 
world.  The  total  size  of  these  images  is  17  MB.  We  divided  each  image  into  equal-sized 
rectangular  tiles  and  yielded  a  total  of  14,  336  tiles.  Figure  2.1  includes  a  snapshot  of  1,  024 
tiles. 

•  SAT1.5GB  -  the  data  set  is  made  up  of  three  satellite  images  in  the  GeoTIFF  format,  each 
of  size  ~  500MB.  The  total  number  of  equal-sized  rectangular  tiles  is  721, 408. 

•  SATLARGE  -  this  proprietary  data  set  contains  a  pan  QuickBird  image  of  size  1.8  GB, 
and  its  matching  4-band  multi- spectral  image  of  size  450  MB  each.  These  images  were 
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Figure  2.5:  Graph  construction  time  versus  size  of  the  subsets  randomly  sampled  from 
SAT1.5GB.  QMAS :  red  circles;  GCap:  blue  crosses;  GCap-ANN:  green  diamonds.  Timing  re¬ 
sults  are  averaged  over  10  runs;  error  bars  are  too  tiny  to  be  discerned. 


combined  as  described  previously  in  Section  2.3.1,  and  2,570,055  hexagonal  tiles  were 
generated. 

The  experiment  is  designed  to  demonstrate  the  performance  of  QMAS  in  terms  of  computa¬ 
tional  time,  non-labor  intensive  labeling,  and  identification  of  representatives  as  well  as  outliers. 
We  also  highlight  results  from  a  set  of  query-by-excimple  tests  performed  over  the  proprietary 
data;  such  queries  exemplify  practical  retrieval  tasks  in  satellite  image  analysis.  The  baseline 
algorithm  to  be  compared  against  for  performance  evaluation  is  the  Graph-based  automatic  im¬ 
age  Captioning  method  [84],  with  two  different  approaches  of  nearest  neighbor  finding  in  the 
graph  construction:  either  the  basic  quadratic  algorithm  (GCap)  and  or  the  approximate  nearest 
neighbors  using  the  ANN  Library  (GCap-ANN).  The  number  of  nearest  neighbors  is  set  to  seven. 
All  three  approaches  share  the  same  implementation  of  random  walk  algorithms  with  the  restart 
parameter  set  to  c  =  0.15.  They  were  executed  on  a  LINUX  server  using  a  single  2.8GHz  CPU 
core  and  4GB  RAM  available. 

2.4.2  Computational  Time 

Figure  2.5  compares  the  elapsed  time  for  graph  construction  using  the  SAT1.5GB  data  set  and 
smaller  randomly  sampled  subsets.  On  the  entire  SAT1.5GB  data  set,  QMAS  is  40  times  faster 
than  GCap-ANN,  while  running  GCap  will  take  hours  (not  shown).  QMAS  scales  almost  linearly 
with  the  input  data  size,  while  the  slope  of  log-log  curves  are  2.1  and  1.5  for  GCap  and  GCap- 
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Figure  2.6:  Labeling  accuracy  versus  the  number  of  pre-labeled  examples  for  each  labeling  class. 
QMAS :  red  circles;  GCap-ANN:  green  diamonds.  Accuracy  values  of  QMAS  are  barely  affected 
by  the  size  of  the  pre-labeled  data.  Box  plots  summarize  10  runs  over  randomly  selected  inputs, 
with  outliers  (typically  over  3  standard  deviations  from  the  mean)  indicated  by  red  circles  and 
green  diamonds. 
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ANN,  respectively.  Instead  of  performing  nearest  neighbor  searches,  which  is  super-linear  even 
with  a  state-of-the-art  approximation  algorithm,  QMAS  employs  a  linear- time  clustering  algo¬ 
rithms  to  leverage  the  content  similarity  in  satellite  image  tiles,  and  achieves  superior  scalability. 


2.4.3  Non-labor  Intensive  Labeling 

We  manually  labeled  256  tiles  in  the  SAT1.5GB  data  set  as  the  ground  truth.  By  randomly  choos¬ 
ing  a  small  number  of  these  labels  as  input  and  leaving  out  remaining  ones  for  evaluation,  we 
compare  the  labeling  accuracy  of  the  three  approaches  from  10  repetitive  runs  and  display  qual¬ 
ity  results  as  box  plots  in  Figure  2.6.  QMAS  does  not  sacrifice  quality  for  faster  computational 
time  when  compared  with  GCap-ANN,  and  it  actually  performs  even  better  when  the  size  of  the 
pre-labeled  data  is  limited.  Additional  experiments  have  shown  given  10  pre-labeled  examples 
for  each  class,  even  under  the  optimal  performance- speed  trade-off  for  GCap-ANN,  where  the 
number  of  nearest  neighbors  set  to  three,  QMAS  is  still  1.75  times  faster  and  around  10%  more 
accurate.  Note  that  the  accuracy  of  QMAS  is  barely  affected  by  the  number  of  the  pre-labeled 
examples  in  each  label  class.  The  fact  that  it  can  still  extrapolate  from  tiny  sets  of  pre-labeled 
data  ensures  its  non-labor  intensive  capability. 
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Figure  2.7:  Representatives  for  the  GeoEye  dataset,  each  colored  according  to  the  cluster  mem¬ 
bership. 


Figure  2.8:  Top-3  outliers  for  the  GeoEye  dataset. 

2.4.4  Finding  Representatives  and  Outliers 

Figure  2.7  shows  data  representatives  obtained  on  the  GeoEye  data  set.  A  total  of  6  representa¬ 
tives  are  displayed,  which  are  colored  according  to  their  clusters.  Note  that  a  large  cluster,  such 
as  “Water”,  may  have  multiple  representatives. 

Figure  2.8  presents  the  top-3  outliers  on  the  same  data  set.  Closer  inspection  found  that  these 
outlier  tiles  tend  to  be  on  the  border  of  areas  like  “water”  and  “city”,  where  a  bridge  usually 
appears.  These  3  outlier  tiles,  together  with  6  representatives,  compactly  summarize  the  GeoEye 
data  set,  which  contains  more  than  14  thousand  tiles. 
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Query  results  for  most  similar  “Water”  tiles 


Figure  2.9:  Example  “Water”:  Labeled  Data  and  Results  of  Water  Query. 


Training  Set  Samples  (3  training  tiles) 


Query  results  for  most  similar  “boat”  tiles 


Query  results  broader  than  training  set 


Figure  2.10:  Example  “Boats”:  Labeled  Data  and  Results  of  Boat  Query. 
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2.4.5  Query-by-Example  Experiments 

The  query-by-example  experiments  were  carried  out  for  the  proprietary  SATLARGE  data  set  by 
domain  experts  in  satellite  image  analysis.  Given  a  small  set  of  tiles  as  examples,  they  would  like 
to  find  most  similar  tiles  over  the  entire  data  set.  To  apply  QMAS,  the  given  tiles  are  assigned  with 
a  single  label,  and  we  performed  random  walk  based  algorithms  to  find  tiles,  other  than  those 
already  given,  that  are  mostly  similar  to  this  label.  For  instance,  Figures  2.9  and  2.10  exemplify 
the  results  obtained  for  “Water”  and  “Boats”,  respectively.  We  also  varied  the  size  of  the  pre¬ 
labeled  data  in  these  experiments  between  one  and  fifty,  to  observe  how  the  system  responded 
to  these  changes.  In  general,  labeling  only  small  numbers  of  examples  (even  less  than  five)  still 
leads  to  accurate  results;  when  the  number  was  reduced  to  2,  we  began  to  see  the  negative  effects 
of  having  an  exceedingly  small  input  set. 

Notice  that  correct  returned  results  often  look  very  different  from  the  given  samples,  i.e., 
the  system  is  able  to  extrapolate  from  the  given  examples  to  other,  correct  tiles  that  do  not  have 
significant  resemblance  to  the  pre-labeled  set.  Clearly,  this  is  not  a  typical  automated  target 
recognition  (ATR)  approach.  There  are  no  “templates”  and  no  specific  object  shapes,  orienta¬ 
tions,  sizes,  or  patterns  that  are  learned.  Unlike  a  traditional  ATR  that  typically  fails  when  it 
encounters  an  object  that  does  not  fit  the  specified  description,  QMAS  is  able  to  correctly  label 
an  object  that  has  a  somewhat  different  appearance  from  the  “known”  set. 

2.5  Conclusion 

In  this  chapter  we  proposed  QMAS,  a  fast  solution  to  low-labor  labeling,  mining  and  attention 
routing  for  multi-modal  databases.  We  carried  out  experiments  in  the  scenario  of  satellite  image 
analysis  to  evaluate  its  performance.  QMAS  scales  linearly  over  the  size  of  the  data  set,  being 
multiple  times  faster  than  an  alternative  algorithm  in  graph  construction.  At  the  same  time,  it 
provides  high  quality  labeling  results,  even  with  tiny  sets  of  pre-labeled  data  as  inputs.  It  could 
also  spot  top  representatives  and  outliers  and  offered  a  compact  summarization  of  a  large  data  set. 
The  implementation  was  also  employed  to  perform  a  set  of  practical  queries  over  a  proprietary 
data  set  by  domain  experts  and  it  yielded  quite  positive  results  -  QMAS  is  able  to  correctly  label 
an  object  where  the  traditional  automated  target  recognition  (ATR)  approach  may  fail. 

Future  directions  include  leveraging  the  locality  within  images,  i.e.,  the  fact  that  image  tiles 
that  are  neighbors  are  more  likely  to  share  similar  labels,  and  generalizing  the  method  to  handle 
an  ontology  of  labels. 
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Chapter  3 


MultiAspectForensics :  Pattern  Mining  on 
Large-scale  Heterogeneous  Networks  with 
Tensor  Analysis 


Modern  applications  such  as  web  knowledge  base,  network  traffic  monitoring  and  online  social 
networks  have  made  available  an  unprecedented  amount  of  network  data  with  rich  types  of  in¬ 
teractions  carrying  multiple  attributes,  for  instance,  port  number  and  timestamp  in  the  case  of 
network  traffic.  The  design  of  algorithms  to  leverage  this  structured  relationship  with  the  power 
of  computing  to  assist  researchers  and  practitioners  for  better  understanding,  exploration  and 
navigation  of  this  space  of  information  has  become  a  challenging,  albeit  rewarding,  topic  in  so¬ 
cial  network  analysis  and  data  mining.  The  constantly  growing  scale  and  enriching  genres  of 
network  data  always  demand  higher  levels  of  efficiency,  robustness  and  generalizability  where 
existing  approaches  with  successes  on  small,  homogeneous  network  data  are  likely  to  fall  short. 

MultiAspectForensics,  introduced  in  this  chapter,  is  a  handy  tool  to  automatically  detect  and 
visualize  novel  subgraph  patterns  within  a  local  community  of  nodes  in  a  heterogeneous  network, 
such  as  a  set  of  vertices  that  form  a  dense  bipartite  graph  whose  edges  share  exactly  the  same  set 
of  attributes.  We  apply  the  proposed  method  on  three  data  sets  from  distinct  application  domains, 
present  empirical  results  and  discuss  insights  derived  from  these  patterns  discovered.  Our  algo¬ 
rithm,  built  on  scalable  tensor  analysis  procedures,  captures  spectral  properties  of  network  data 
and  reveals  informative  signals  for  subsequent  domain- specific  study  and  investigation,  such  as 
suspicious  port-scanning  activities  in  the  scenario  of  cyber-security  monitoring. 

This  chapter  will  be  structured  as  follows:  we  first  motivate  the  discussion  in  Section  3.1,  and 
then  elaborate  on  MultiAspectForensics  procedures  step-by-step  in  Section  3.2.  Empirical  results 
are  presented  in  Section  3.3.  And  related  literatures  are  briefly  sketched  in  Section  3.4.  Lastly, 
Section  3.5  concludes  the  chapter  and  highlights  future  directions.  Most  of  the  work  described 
subsequently  is  based  on  the  material  presented  in  [75]. 
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3.1  Introduction 


Modern  applications  in  the  Internet  era,  either  data-informed  or  data-driven,  have  contributed  to 
the  boom  of  network  data  arising  from  a  spectrum  of  domains,  such  as  web  knowledge  base, 
network  traffic  monitoring  and  online  social  networks.  A  glowing  trend  in  the  accumulation 
and  analysis  of  such  data  is  the  emergence  of  heterogeneous  interactions  between  nodes  in  the 
network,  for  which  a  vivid  depiction  is  offered  by  the  Facebook  friendship  page,  with  multiple 
page  elements  ranging  from  wall  posts,  comments,  and  photos,  to  mutual  friends,  shared  inter¬ 
ests  and  common  networks  between  a  pair  of  users.  Browsing  and  navigation  over  such  a  space 
of  information,  despite  its  overwhelming  scale  and  complexity,  has  been  a  challenging  task  com¬ 
mon  encountered  in  many  fields.  Yet  the  rather  recent  availability  and  popularity  of  these  data, 
in  addition  to  practical  requirements  over  the  efficiency,  robustness  and  generalizability  of  the 
solution,  has  rendered  the  topic  of  pattern  mining  for  heterogeneous  network  data  a  relatively 
underexplored  one,  where  even  the  definition  of  interesting  or  abnormal  patterns  could  become 
a  non-trivial  problem  itself. 

Many  of  pioneering  studies  on  pattern  discovery  for  graph  and  network  data  focused  on 
frequent  substructure  mining,  with  heuristics  motivated  by  information  theory  [29],  mathematical 
graph  theory  [67,  116],  inductive  logic  programming  [35],  etc.  An  intimately  related  problem 
is  the  detection  of  rare  event  and  anomalous  behavior,  which  has  attracted  wide  interests  thanks 
to  its  many  well-recognized  applications  concerned  with  security,  risk  assessment,  and  fraud 
analysis.  Noble  and  Cook  [82]  were  among  the  first  to  address  this  challenge  on  structured 
network  data  by  providing  solutions  based  on  the  minimal  description  length  principle  to  search 
for  abnormal  subgraphs.  And  many  alternative  approaches  are  now  available  to  spot  anomalous 
nodes  [6],  edges  [24],  or  both  [38],  with  further  elaboration  adapted  to  bipartite  graphs  [103], 
and  time-evolving  graphs  [109].  This  piece  of  work,  by  revealing  two  classes  of  patterns  in 
the  context  of  heterogeneous  graphs,  resembles  a  novel  attempt  to  explore  this  relatively  young 
realm  of  multi-aspect  network  data  for  state-of-the-art  discoveries  and  developments. 

We  resort  to  a  tensor-based  representation  for  heterogeneous  network  data  and  employ  off- 
the-shelf  decomposition  algorithms  [64]  as  a  starting  point  of  the  analysis.  Previous  research 
along  this  line  has  paid  a  great  deal  of  attention  on  individual  nodes,  which  play  a  central  role  in 
similarity  ranking  [41],  personalized  recommendation  [118],  etc.  The  major  finding  in  our  study 
is  that,  for  multiple  heterogeneous  network  data  across  diverse  application  domains,  we  could 
always  observe  groups  of  elements  with  similar  connections  along  one  or  more  data  modes, 
as  implied  by  nearly-identical  decomposition  scores,  which  transform  to  quite  visible  spikes  in 
histogram  plots.  While  algorithms  in  aforementioned  studies  mostly  look  for  elements  with  top 
eigenscores,  our  heuristic  distinguishes  itself  by  being  able  to  capture  patterns  formed  by  less 
well-connected  nodes  in  the  network ,  which  do  not  necessarily  stand  out  in  the  eigenspace  and 
are  often  ignored  by  other  extant  techniques. 

In  summary,  MultiAspectForensics  starts  with  a  data  decomposition  step  for  input  heteroge¬ 
neous  networks,  features  a  spike  detection  heuristic  to  reveal  non-trivial  substructure  patterns, 
and  also  includes  programs  to  automatically  visualize  the  findings.  We  demonstrate  its  effec¬ 
tiveness  and  efficiency  by  executing  MultiAspectForensics  on  three  data  sets  from  distinct  appli- 
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Figure  3.1:  Illustration  of  the  CP  decomposition:  the  input  3-mode  tensor  on  the  left  is  decom¬ 
posed  into  R  triplets  of  vectors  on  the  right,  reminiscing  of  the  rank-f?  singular  value  decom¬ 
position  of  a  matrix.  The  three  modes,  in  a  scenario  of  network  traffic  analysis,  may  represent 
source  IP  address  (red),  destination  IP  address  (blue)  and  port  number  (green),  respectively. 


cation  scenarios,  present  empirical  results  and  investigate  the  discovered  patterns,  which  could 
be  leveraged  to  suggest  suspicious  activities  from  network  traffic  logs  such  as  port-scanning  and 
denial-of-service  attack,  extract  interesting  facts  from  a  web  knowledge  base  such  as  punk  musi¬ 
cians  or  low-cost  airline  destinations,  and  report  gene  function  groups  in  a  developmental  biology 
study  consistent  with  established  theories. 


3.2  Proposed  Algorithm 

MultiAspectForensics ,  in  a  nutshell,  consists  of  the  following  steps: 

•  Data  Decomposition :  take  the  input  heterogeneous  network  as  a  tensor  and  perform  tensor 
decomposition  to  obtain  an  eigenscore  vector  along  each  data  mode. 

•  Spike  Detection  in  Histograms :  iterate  over  all  data  modes  to  obtain  histograms  and  apply 
the  spike  detection  algorithm. 

•  Substructure  Discovery :  identify  the  induced  subgraph/subtensor  for  each  spike  and  sum¬ 
marize  patterns  discovered. 

•  Visualization :  create  attribute  plots  and  histogram  plots  with  detected  spikes  highlighted. 
The  above  procedure  just  makes  use  of  the  strongest  component  after  data  decomposition.  If 
the  contribution  of  the  top  one  eigen-component  is  not  as  large,  the  latter  three  steps  should  be 
carried  out  over  multiple  strongest  components  in  a  similar  fashion.  For  brevity,  we  subsequently 
elaborate  on  three  algorithmic  steps  with  only  the  first  component  taken  into  consideration,  and 
the  visualization  step  is  illustrated  by  resulting  figures  intermixed  with  the  rest  of  the  discussion. 

3.2.1  Data  Decomposition 

We  first  introduce  a  few  definitions.  A  tensor  can  be  represented  as  a  multi-dimensional  array 
of  scalars.  Its  order  is  the  dimensionality  of  the  array,  while  each  dimension  is  known  as  one 
mode ,  of  which  the  value  ranges  over  the  set  of  elements  for  the  specific  mode.  Thus,  vectors 
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are  tensors  of  order  one,  and  matrices  are  tensors  with  two  modes.  In  Section  3.3  we  will  use 
measure  to  denote  the  unit  of  each  entry  in  the  multi-dimensional  array. 

To  transform  a  heterogeneous  network  into  a  tensor,  every  edge  becomes  a  non-zero  entry 
in  the  multi-dimensional  array,  where  edge  attributes,  together  with  edge  source  and  destination, 
make  up  different  modes  of  the  tensor.  Edge  weights  naturally  stay  as  entry  values  for  weighted 
networks.  Node  attributes  could  also  be  incorporated  by  taking  a  Cartesian  product  over  two  end 
points  of  an  edge,  for  instance,  if  a  directed  network  contains  nodes  with  7  different  colors,  we 
could  have  an  edge  attribute  whose  arity  is  72  =  49. 

Tensor  decomposition  leverages  multi-linear  algebra  to  the  analysis  of  high-order  data.  The 
canonical  polyadic  (CP)  decomposition  we  applied  in  this  chapter  generalizes  the  singular  value 
decomposition  (SVD)  for  matrices.  It  factorizes  a  tensor  to  the  weighted  sum  of  outer  products  of 
mode-specific  vectors,  as  illustrated  in  Figure  3.1  for  a  3-order  tensor.  Formally,  for  an  M -mode 
tensor  X  of  size  1\  x  J2  x  •  •  •  x  Im,  its  CP  decomposition  of  rank  R  yields 

It 

X(R, . . .  ,iM)  ~  ^  \r  (afi  x  ...  x 

r= 1  k 

R  M 

r— 1  m=  1 

Similar  to  SVD,  the  approximation  becomes  closer  as  R  enlarges,  and  would  be  exact  if  it  equals 
the  rank  of  the  tensor  (see  [53]  for  details). 

3.2.2  Spike  Detection  in  Histograms 

Now  that  we  have  transformed  complex  structured  data  into  a  set  of  more  manageable  vectors, 
the  next  step  is  to  spot  common  patterns  from  these  vectors.  As  a  starting  point,  we  visualize 
each  vector  by  creating  an  attribute  plot,  which  displays  absolute  values  of  eigenscores  (y-axis) 
along  its  elements  (indexed  by  the  x-axis).  An  example  of  such  plots  is  given  in  Figure  3.2. 
Note  that  the  y-axis  should  be  in  log  scale  to  emphasize  the  relative  difference.  The  arrow  on 
the  right  indicates  a  score  value  shared  by  many  elements,  i.e.,  a  number  of  entries  in  the  eigen¬ 
vector  have  exact  the  same  values,  which  is  not  uncharacteristic  in  other  dimensions  and  across 
different  data  sets.  This  key  observation  enables  us  to  create  effective  heuristics  to  extract  spikes 
from  histograms  and  subsequently  examine  subgraph  patterns  they  imply  in  the  next  subsection. 
And  the  fact  that  many  spikes  do  not  appear  at  the  very  top  of  the  figure  with  most  significant 
eigenscore  values  makes  it  more  difficult  for  many  alternative  methods  to  be  effective. 

Prior  to  applying  the  spike  detection  heuristics,  we  obtain  histogram  data  by  equally  dividing 
the  range  of  eigenscores  in  log  scale.  The  detection  algorithm  just  needs  to  sort  and  traverse  the 
histogram  data  until  one  of  the  following  conditions  is  satisfied:  (1)  the  energy  as  measured  by 
sum  of  square  values  covered  is  equal  or  more  than  a  fraction  of  s,  and  the  magnitude  of  the 
spike  is  less  than  a  fraction  of  r  than  the  largest  one;  (2)  there  are  already  K  spikes.  Parameter 
values  are  empirically  set  to  s  =  90%,  r  =  50%,  K  =  20,  where  small  variations  lead  to  little 
perturbation  of  the  output.  The  pseudo-code  of  the  algorithm  is  listed  in  Algorithm  2  above. 
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Figure  3.2:  An  attribute  plot  which  displays  absolute  values  of  eigenscores  (y-axis  in  log-scale) 
along  its  elements  (indexed  by  the  x-axis).  The  arrow  on  the  right  points  to  a  common  score 
value,  illustrating  an  observation  critical  to  the  algorithmic  design  of  MultiAspectForensics . 

Application  of  this  algorithm  to  the  data  vector  in  Figure  3.2  yields  Figure  3.3,  where  we  put 
attribute  plot  on  the  left  side-by-side  with  histogram  plot  on  the  right,  highlighting  every  spike  in 
red. 

3.2.3  Substructure  Discovery 

Having  extracted  sets  of  elements  that  form  histogram  spikes  from  each  data  mode,  we  head 
back  to  the  input  network  data  to  examine  corresponding  local  subnetworks  to  complete  the  final 
step  of  pattern  discovery.  The  running  example  in  this  subsection  comes  from  a  snapshot  of 
network  traffic  log  which  consists  of  packet  traces  in  an  enterprise  network  [70].  Each  trace  in 
the  log  is  a  triplet  of  ( source-IP ,  destincition-IP,  port-number),  which  could  be  represented  as  a 
directed  network  of  machine  IP  addresses  with  the  only  edge  attribute  “port  number”  and  number 
of  packets  as  edge  weights.  Patterns  derived  from  MultiAspectForensics  could  be  summarized 
into  the  following  two  categories: 

generalized  star 

A  subnetwork  which  consists  of  conterminous  edges  that  differ  only  in  one  data  mode.  For  in¬ 
stance,  a  group  of  source  IP  addresses  sending  packets  to  a  single  destination  server  using  the 
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Figure  3.3:  An  attribute  plot  (adopted  from  Figure  3.2)  on  the  left  side-by-side  with  the  corre¬ 
sponding  histogram  plot  with  spikes  detected  indicated  by  circles. 


same  port.  It  generalizes  the  star  pattern  in  two  dimensional  graphs,  and  makes  up  a  continuous 
block  along  one  dimension  in  the  adjacency  tensor,  if  elements  along  that  dimension  are  ordered 
carefully.  Note  that  in  a  heterogeneous  network,  this  category  of  patterns  also  includes  multiple 
edges  between  one  pair  of  nodes  with  differing  attribute  values,  e.g.,  a  good  many  port  numbers 
in  our  running  example,  in  which  case  the  source  machine  may  be  either  an  administrator  per¬ 
forming  port  screening  or  a  suspect  trying  to  exploit  a  vulnerable  port.  Figure  3.4  provides  an 
illustration  of  these  patterns. 

generalized  bipartite-core 

A  subnetwork  that  represents  a  dense  bipartite  structure  similar  to  the  bipartite-core  pattern  in 
regular  graphs,  and  is  akin  to  association  rules  as  well.  More  generally,  it  can  be  viewed  as  a  con¬ 
tinuous  block  along  two  dimensions  in  higher-order  tensors  under  specific  element  orders.  For 
instance,  a  group  of  source  IP  addresses  sending  packets  to  multiple  destination  servers  with  the 
same  port.  Note  that  in  a  heterogeneous  network,  this  category  of  patterns  also  includes,  written 
in  the  language  of  network  forensics,  multiple  source  IP  addresses  sending  packets  over  different 
port  numbers  to  the  same  server.  This  is  likely  to  happen  during  a  DDoS  (Distributed  Denial- 
of-Service)  attack,  a  typical  scenario  of  network  intrusion,  in  which  source  IPs  play  the  role  of 
malicious  hosts  sending  huge  volumes  of  packets  to  the  target  server  as  the  victim.  Figure  3.5 
provides  an  illustration  of  these  patterns. 

As  a  final  remark,  the  statement  that  both  patterns  are  related  to  a  block  along  one  or  two 
dimensions  in  the  high-order  tensor  only  holds  when  elements  of  their  respective  data  modes  are 
ordered  in  specific  ways.  And  the  complexity  to  search  for  such  an  order  is  generally  exponential, 
which  reflects,  in  some  sense,  the  power  of  the  proposed  approach. 
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Figure  3.4:  Examples  of  generalized  star  patterns  discovered  in  the  LBNL  (Lawrence  Berkeley 
National  Lab)  network  traffic  data  set.  Wavy  arrows  indicate  multiple  edges  between  the  pair  of 
nodes  with  a  handful  of  distinct  attribute  values,  (a)  10  source  IP  addresses  (randomly  selected 
out  of  172  ones)  are  sending  multiple  packets  to  a  server  machine  with  Port#  534,  which  is  a 
UDP  port  under  the  NCP  protocol  from  a  network  OS  for  file  sharing  and  printing  services;  (b) 
The  source  IP  registered  by  an  Indian  ISP  is  sending  packets  to  a  host  in  LBNL  via  port  numbers 
(ranging  from  2,300  to  2,900)  not  usually  intended  for  this  type  of  communication,  implying  a 
suspicious  activity. 
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Figure  3.5:  Examples  of  generalized  bipartite-core  patterns  discovered  in  the  LBNL  (Lawrence 
Berkeley  National  Lab)  network  traffic  data  set.  Wavy  arrows  indicate  multiple  edges  between 
the  pair  of  nodes  with  a  handful  of  distinct  attribute  values,  (a)  10  source  IP  addresses  (ran¬ 
domly  selected  out  of  119  ones)  are  sending  multiple  packets  to  an  array  of  server  machines, 
including  server  131.243.141.187,  which  also  appears  in  Ligure  3.4  as  part  of  a  generalized  star 
pattern,  over  a  port  used  for  file  sharing  and  printing  services;  (b)  10  source  IP  addresses  (ran¬ 
domly  selected  out  of  63  ones)  are  sending  packets  over  different  ports  to  a  multi-purpose  server 
machine. 
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Algorithm  2  SDA  (Spike  Detection  Algorithm) 

Input:  Eigenscore  histogram  vector  H  of  size  N 
Output:  The  set  indicating  spikes  detected  S 

1:  S  =  (j) 

2:  sort  the  histogram  to  obtain  an  ordered  vector  Ha  s.t.  H0l  >  II 0.2  >  ■  ■  ■  >  HON 

3:  QsUM  J2n= 1  Hn 
4:  Q  <-  0 

5:  for  k  —  1, . . . ,  K  do 
6:  S  i —  S  U  {Ofc} 

7:  Q  -t—  Q  + 

8:  if  Q/Qsum  >  S  and  H(ok)/H(oi)  <  r  then 

9:  break 

10:  end  if 

11:  end  for 
12:  return  S 


Data  set 

#  modes 

Dimensions 

Measure 

#  non-zero 

elements 

LBNL 

4 

2,345  source  IPs,  2,355 
dest  IPs,  6,055  port  #’s, 
3,610  timestamps 

#  packets 

281K 

RTW 

3 

3,641  subjects,  3,929  ob¬ 
jects,  98  verbs 

binary 

10K 

BDGP 

3 

4,491  genes,  248  terms,  6 
stages 

binary 

38K 

Table  3.1:  A  Summary  of  Data  Sets 


3.3  Empirical  Results 

We  commence  this  section  with  the  description  of  data  sets  as  well  as  experimental  environment. 

It  is  followed  by  the  discussion  of  respective  patterns  discovered  by  MultiAspectForensics  in  each 

of  the  three  data  sets. 

3.3.1  Data  and  Environment 

Data  sets  are  acquired  from  three  dissimilar  application  domains:  network  traffic  monitoring, 

knowledge  networks,  and  bioinformatics.  A  summary  is  highlighted  in  Table  3.1. 

LBNL  The  network  traffic  log  is  made  available  through  a  research  effort  to  study  the  char¬ 
acteristics  of  traffic  for  Internet  enterprises  [86].  The  measurement  was  taken  on  servers 
within  the  Lawrence  Berkeley  National  Lab  (LBNL)  from  thousands  of  internal  hosts  over 
time,  with  millions  of  packet  traces  recorded.  Each  packet  trace  includes  four  data  modes: 
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source  IP,  destination  IP,  port  number,  and  a  timestamp  in  second.  For  privacy  reasons, 
lower  16  bits  were  randomly  permuted  to  anonymize  the  host  identity,  whereas  upper  16 
bits  were  kept  intact  for  proper  identification  of  the  location  and  service  provider  [87].  We 
borrowed  a  subset  of  this  data  set  within  1-hour  time  span  in  this  section. 

RTW  This  online  knowledge  base  is  the  outcome  of  the  NELL  (Never-Ending  Language  Learn¬ 
ing)  system  at  Carnegie  Mellon  University  [20].  It  employs  natural  language  processing 
and  machine  learning  techniques  to  constantly  and  automatically  crawl  web  pages  and 
extract  facts  [21].  Each  fact  is  a  triplet  of  (subject,  verb,  object)  such  as  ( Pittsburgh ,  city- 
located-in-state,  Pennsylvania),  which  could  be  represented  as  a  directed  graph  made  up  of 
entities  like  Pittsburgh  or  Pennsylvania,  edges  with  attributes  like  city-located-in-state.  Lor 
better  quality  of  results,  we  applied  our  algorithm  on  a  preprocessed  subset  after  manual 
noise  removal  (by  courtesy  of  Bryan  Kisiel  at  Carnegie  Mellon  University). 

BDGP  The  data  set  is  collected  from  the  Berkeley  Drosophila  Genome  Project  (BDGP)  to  study 
the  spatial-temporal  patterns  of  gene  expression  during  the  early  development  of  fruit 
fly  [106,  107].  We  selected  three  data  modes  from  the  database  dump  available  at  [13], 
which  consists  of  4,491  genes,  248  functional  annotation  terms  from  a  specialized  vocab¬ 
ulary,  and  6  different  developmental  stages. 

MultiAspectForensics  was  implemented  in  the  MATLAB  language,  and  all  following  exper¬ 
iments  were  performed  on  a  Unix  machine  with  four  2.8GHz  cores,  and  16GB  memories.  Lor 
every  of  these  data  sets,  the  wall-clock  time  was  no  more  than  2  minutes  to  carry  out  the  compu¬ 
tation  and  generate  attribute  plots  and  histogram  plots  along  all  modes. 

3.3.2  LBNL  Traffic  Log 

We  have  already  discussed  patterns  discovered  from  a  snapshot  of  this  data  set  in  Section  3.2.3, 
illustrated  in  Ligures  3.4,  3.5.  With  the  additional  mode  of  timestamp,  we  found  two  dominating 
spikes  in  its  histogram  plot.  Upon  closer  examination,  we  reported  the  following  activities:  the 
first  spike  is  a  generalized  bipartite-core  pattern  related  to  the  HTTP  traffic  on  port  80  between 
four  servers  in  LBNL  and  three  remote  hosts  in  Chinese  academic  institutions,  possibly  executing 
scripts  to  crawl/download  web  pages.  The  second  spike  represents  a  generalized  star  pattern 
between  one  of  the  local  HTTP  server  and  the  same  remote  host  at  India  aforementioned.  We 
traced  further  in  time  and  found  that  the  remote  host  never  sent  packets  back  to  acknowledge  the 
connection,  suggestive  of  suspicious  activities  to  be  reported  to  domain  experts. 

3.3.3  RTW  Knowledge  Base 

Recall  that  each  item  in  the  knowledge  database  could  be  represented  as  a  (subject,  verb,  object) 
triplet.  MultiAspectForensics  detected  spikes  mostly  on  data  modes  representing  subjects  and 
objects. 

Ligure  3.6(a)  illustrates  a  subgraph  discovered  revealing  a  generalized  star  pattern.  The  mu¬ 
sic  artists/bands  listed  here  are  specialized  to  punk  music  or  its  sub-genres  (not  shown  in  the 
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Figure  3.6:  Two  generalized  star  patterns  discovered  from  the  RTW  knowledge  base:  (a)  Music 
artists  specialized  in  punk  or  one  of  its  sub-genres  according  to  the  knowledge  base;  (b)  European 
destinations  of  the  Ryanair,  an  Irish  low-cost  airline. 


figure)  according  to  the  knowledge  base,  whereas  their  more  versatile  peers  will  not  be  favorably 
selected  by  MultiAspectForensics. 

Figure  3.6(b)  displays  another  generalized  star  pattern  between  European  cities  and  an  Irish 
low-cost  airline  which  flies  to  many  regional  or  secondary  airports  to  reduce  cost,  following  a 
different  business  model  and  choice  of  destination  from  industrial  giants. 

The  evidence  here  and  many  others  alike  could  also  be  leveraged  in  a  variety  of  graph  mining 
tasks  on  this  knowledge  base  such  as  clustering  entities  or  creating  an  ontology  between  them, 
given  the  fact  that  nodes  within  the  same  spike  tend  to  behave  similarly  and  specifically.  More¬ 
over,  as  a  sanity  check,  since  node  names  are  ordered  alphabetically  in  this  data  set,  the  pattern 
does  not  make  a  continuous  block  in  the  tensor  without  non-trivial  permutation. 


3.3.4  BDGP  Gene  Annotation 

In  this  data  set  MultiAspectForensics  spots  a  set  of  genes  known  to  be  responsible  for  the  mater¬ 
nal  effect  in  the  early  development  of  fruit  fly  (Figure  3.7),  which  also  provides  hints  to  study 
other  higher  organisms  including  Homo  sapiens.  Products  of  such  maternal  effect  genes,  in  the 
form  of  either  protein  or  mRNA,  play  a  critical  role  in  the  very  early  stage  of  embryo  devel¬ 
opment,  such  as  the  first  few  cell  divisions.  For  instance,  four  of  such  genes,  including  bicoid, 
caudal,  hunchback,  and  nanos,  is  mostly  responsible  for  the  determination  of  anterior-posterior 
axis  -  which  side  of  the  embryo  will  be  the  future  head  and  which  other  side  will  be  the  future 
tail  [68]. 
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Figure  3.7:  An  attribute  plot  on  the  left  side-by-side  with  the  corresponding  histogram  plot  for 
the  “gene”  mode  of  the  BDGP  data  set.  The  largest  spike  that  appeared  at  the  bottom  is  the  set  of 
maternal  genes ,  a  special  class  of  genes  that  play  a  vital  role  in  early  embryo  development  such 
as  the  polarity  of  the  egg,  i.e.,  which  part  will  become  the  head  and  which  other  part  turns  into 
the  tail  later. 


3.4  Related  Work 


3.4.1  Anomaly  Detection 

Outlier  detection,  despite  its  wide  interest  across  many  application  domains,  is  usually  a  chal¬ 
lenging  problem,  as  reflected  in  the  fact  that  even  a  formal  definition  is  not  easy  to  make.  A 
classical  one  was  given  by  Hawkins  in  [54]:  “an  observation  that  deviates  so  much  from  other 
observations  as  to  arouse  suspicion  that  it  was  generated  by  a  different  mechanism”. 

Outlier  detection  methods  can  be  categorized  into  two  sets:  parametric,  statistical-based  ap¬ 
proaches,  and  non-parametric,  model-free  approaches.  A  common  characteristic  of  methods  in 
the  former  category  is  the  existence  of  statistical  assumptions  about  the  underlying  data  distribu¬ 
tion  [11].  The  latter  category  usually  makes  the  call  by  resorting  to  distance  computation  [63] 
or  density  estimation  [19,  58].  Besides,  projection-based  methods  [2]  have  been  introduced  for 
high-dimensional  data.  Moreover,  clustering  algorithms  may  output  outlier  labels  as  a  by-product 
(e.g.,  [26]). 

Compared  to  outlier  detection,  anomaly  detection  in  structured  data  has  only  gained  recent 
attention  [25],  where  we  have  reviewed  relevant  studies  in  the  introductory  section  and  claimed 
that  there  is  no  other  attempt,  to  the  extent  of  our  knowledge,  to  discover  similar  patterns  in 
heterogeneous  network  data  as  MultiAspectForensics. 
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3.4.2  Tensor  Analysis  and  Graph  Mining 


Tensor  decomposition  has  been  a  basic  technique  well  studied  and  applied  to  a  wide  range  of  dis¬ 
ciplines  and  scenarios.  An  informative  survey  on  tensor  decompositions  is  presented  by  Kolda 
and  Bader  [64]  with  many  further  references.  Recent  researches  have  further  generalized  the 
CP  decomposition  to  handle  incomplete  data  [1],  or  to  produce  non-negative  components  [97]. 
Tucker  decomposition,  as  the  other  well-known  approach,  is  more  flexible,  although  its  applica¬ 
tion  is  usually  limited  by  its  limited  scalability  and  vulnerability  to  noise.  Notably,  recent  work 
on  scalable  alternatives  such  as  [111]  may  open  up  the  venue  to  enhance  the  MultiAspectForen¬ 
sics  methodology  with  more  powerful  decomposition  algorithms. 

Quite  a  few  popular  implementations  of  tensor  decomposition  algorithms  for  academic  re¬ 
searchers  have  been  made  publicly  available.  Examples  are  the  N-way  toolbox  by  Andersson 
and  Bro  [7]  and  the  more  recent  MATLAB  Tensor  Toolbox  by  Bader  and  Kolda  [9]. 

Tensor  analysis  has  also  been  applied  to  study  the  dynamics  of  graphs  and  networks  [104]. 
They  commonly  start  by  analyzing  graph/tensor  snapshots  within  each  timestamp,  and  take  the 
output  for  subsequent  time-series  analysis.  MultiAspectForensics,  instead  of  focusing  on  the 
evolution  between  adjacent  timestamps,  treats  timestamp  as  another  data  mode  to  allow  better 
discovery  of  global  patterns  in  this  trade-off. 


3.5  Conclusion 

We  presented  MultiAspectForensics,  a  handy  and  effective  tool  to  automatically  detect  and  visu¬ 
alize  a  category  of  novel  patterns,  including  generalized  star  and  generalized  bipartite-core  pat¬ 
terns,  within  a  local  community  of  nodes  in  heterogeneous  networks,  even  if  they  exist  among 
less-well  connected  nodes  which  are  more  likely  to  be  ignored  by  many  extant  methods.  Empir¬ 
ical  results  exhibited  valuable  insights  derived  from  pattern  discovered,  across  multiple  applica¬ 
tion  domains  such  as  network  traffic  monitoring,  knowledge  networks,  and  bioinformatics.  These 
successes  could  be  attributed  to  the  fact  that  we  resorted  to  a  tensor-based  representation  to  facil¬ 
itate  data  decomposition,  reached  a  key  observation  leading  to  spike  patterns  in  histogram  plots, 
and  revealed  typical  substructures  reflecting  spectral  properties  of  heterogeneous  data.  Hence 
MultiAspectForensics  realizes  an  early  attempt  to  research  substructure  patterns  commonly  ex¬ 
isting  in  heterogeneous  network  data,  and  a  reasonable  use  case  of  tensor  analysis,  despite  the 
simplicity  of  heuristics  resided. 

An  important  problem  beyond  the  scope  of  this  manuscript  is  the  design  of  an  objective  and 
quantitative  evaluation  framework  of  discovered  patterns,  especially  for  large-scale  networks  for 
which  it  is  prohibitive  to  label  every  interesting  pattern.  Although  it  may  be  relatively  to  define 
precision  and  recall  by  exhaustively  searching  for  subgraphs  bearing  the  specified  pattern  such  as 
generalized  star  or  generalized  bipartite  core,  the  definition  of  quality,  or  value  of  these  patterns 
from  automated  discovery,  is  usually  domain  and  context  specific  -  even  may  not  be  losslessly 
quantified.  This  would  also  shed  lights  on  a  principle  way  of  optimizing  parameters,  yet  we  found 
that  results  were  usually  not  sensitive  to  parameter  values  when  they  vary  within  reasonable 
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ranges.  Meanwhile,  it’s  our  plan  to  open-source  the  MultiAspectForensics  tool  based  on  the 
generic  boost  graph  library  [99]  to  make  it  more  accessible  and  usable  by  industrial  practitioners 
and  academic  researchers,  and  collect  feedbacks  for  possible  future  developments. 


38 


Part  III 

Querying  Multimedia  Data 
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Chapter  4 

CDEM :  Flexible  Querying  System  for 
Biological  Image  Databases 


Given  a  large  collection  of  images  documenting  the  spatial  patterns  of  gene  activities  on  football¬ 
shaped  embryos,  can  we  perform  content-based  retrieval  to  find  similar  patterns  for  an  existing 
or  new  input?  Can  we  leverage  this  similarity  to  group  together  genes  that  display  correlated 
patterns,  which  suggests  a  greater  likelihood  for  them  to  participate  in  the  same  biological  pro¬ 
cess  in  the  development  of  the  embryo?  Can  we  construct  a  network  of  genes  to  visualize  such 
relationships  known  as  co-expression?  Moreover,  most  of  the  genes  bear  a  handful  of  labels, 
manually  curated  by  domain  experts  using  anatomical  terms  to  indicate  the  body -part  (e.g.,  “em¬ 
bryonic  hindguf ’)  with  most  significant  expression  activity,  can  we  employ  this  additional  se¬ 
mantic  information  to  improve  the  retrieval  results,  or  automatically  assign  such  labels  to  new 
and  upcoming  images? 

This  chapter  and  the  accompanying  online  query  interface  is  an  initial  attempt  to  address 
these  questions.  Part  of  the  work  described  subsequently  is  based  on  the  material  presented  in 
[48], 

4.1  Background 

How  an  organism  develops  from  a  single  cell,  one  of  the  great  mysteries  of  life,  has  always 
been  an  intriguing  problem  for  biologists.  To  uncover  the  genetic  foundation  of  animal  design, 
extensive  in  vitro  and  in  vivo  studies  have  been  carried  out  to  decode  the  early  development 
of  Drosophila  melanogaster ,  or  fruit  fly,  with  the  expectation  that  understanding  gained  in  this 
organism  model  may  apply  to  other  species,  not  excluding  human  beings,  as  well.  Thanks  to 
the  recent  advancement  in  biomedical  imaging  technologies,  it  is  now  possible  to  make  high- 
resolution  image  recording  of  2D  or  even  3D  spatial  signals  capturing  gene  expression  in  cells 
and  living  organisms.  As  a  popular  type  of  data  characterizing  complex  biological  systems, 
many  of  these  images  are  digitally  stored  into  multimedia  databases  and  made  publicly  accessible 
via  various  online  and  offline  interfaces  for  querying  and  browsing.  For  instance,  the  Berkeley 
Drosophila  Genome  Project  (BDGP)  [106,  107]  provides  an  online  database  of  two-dimensional 
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Gene 

CG  15141 

Annotation  Terms 

brain,  central  nervous  system, 

glia,  neurons,  ventral  nerve  cords 

Developmental  Stage 

13-16 

Figure  4.1:  A  fruit  fly  embryo  image  with  all  its  attributes  sampled  from  the  BDGP  database. 
The  original  filename  is  insitu65  954  .  jpe. 


fruit  fly  embryo  images,  which  includes  three  modes  of  data:  more  than  70,000  images  of  size 
up  to  1520  by  1080  pixels,  around  3,000  genes,  and  several  hundred  annotation  terms.  Figure  4.1 
illustrates  one  image  from  the  database  together  with  all  its  attributes. 

Every  image  is  associated  with  a  single  gene,  of  which  the  expression  is  documented  digi¬ 
tally.  There  are  up  to  a  few  annotation  terms  from  a  controlled  vocabulary  as  a  textual  description 
of  the  pattern,  indicating  parts  of  the  body  that  have  darker  colors  and  more  significant  expres¬ 
sions.  Image-based  analysis  is  still  indispensable  since  subtle  patterns  may  not  be  captured  by 
the  vocabulary  of  limited  size.  Also,  each  image  has  a  time  stamp  which  is  labeled  using  one 
of  the  six  predefined  developmental  stage  ranges.  Patterns  under  different  stages  are  not  directly 
comparable  due  to  the  drastic  morphological  changes  in  the  developmental  process.  This  also 
leads  to  the  difference  in  the  set  of  annotation  terms  eligible  for  each  stage  ranges. 

We  present  a  general  framework  in  this  chapter  on  which  a  number  of  interesting  tasks  around 
this  multi-modal  data  collection  of  genes,  image  patterns,  and  text  annotations.  For  example, 

•  Multi-modal  Retrieval :  given  one  or  more  images/genes/terms,  what  are  the  most  relevant 
images/genes/terms?  This  could  assist  biologists  to  find  similarities  with  each  data  modal 
or  perform  cross-modal  queries  such  as  annotation  term  suggestions  for  images. 

•  Multi-modal  Clustering :  find  groups  of  images/genes/terms  that  members  are  more  closely 
linked  to  each  other  than  with  non-members.  The  hope  is  that  each  group  may  correspond 
to  one  or  more  biological  functions  so  that  subsequent  biological  analysis  could  be  more 
focused  on  a  smaller  set  of  genes. 

•  Network  Construction :  draft  a  network  between  a  subset  of  genes  where  links  between 
genes  represent  spatial  co-expression  and  may  provide  hint  on  meaningful  biological  in¬ 
teractions. 

The  basic  idea  of  our  approach  is  constructing  a  graph  of  multiple  types  of  nodes  represent¬ 
ing  different  mode  of  data.  Graph-based  algorithms  such  as  random  walk  with  restart  could  be 
adapted,  and  the  infrastructure  and  the  algorithm  employed  are  readily  scalable  to  handle  a  rela¬ 
tively  large  volume  of  data.  An  online  interface  is  made  available  accordingly  to  assist  biologists 
to  browse  and  navigate  the  data  set. 
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insitu8820 


Figure  4.2:  A  tri-partite  graph  constructed  from  the  BDGP  database. 


4.2  Proposed  Method 

4.2.1  Graph  Construction 

As  a  prerequisite  for  all  the  subsequent  algorithmic  development,  we  construct  a  heterogeneous 
graph  which  consists  of  multiple  layers  of  nodes,  each  of  which  represents  one  of  the  data  modes 
(genes,  images,  or  annotation  terms).  The  connection  between  different  data  modes  are  ab¬ 
stracted  as  edges  between  corresponding  nodes.  For  instance,  as  illustrated  in  Figure  4.2,  gene 
node  CG32369  is  connected  to  an  image  node  insitu8820  and  a  term  node  embryonic 
midgut.  This  indicates  that  the  image  with  the  label  insitu8  82  0  documents  the  expression 
pattern  for  gene  CG32 3  6 9.  And  such  spatial  pattern,  as  well  as  those  recorded  in  other  images 
of  the  gene  in  the  same  developmental  stage,  could  be  (partially)  located  at  the  embryonic 
midgut,  i. e. ,  the  part  of  embryo  from  which  most  of  the  intestines  are  derived.  Note  that  an¬ 
notation  nodes  are  connected  nodes  representing  genes  rather  than  image  nodes  because  based 
on  the  data  source  available,  given  a  particular  developmental  stage,  images  from  the  same  gene 
always  share  the  same  set  of  annotation  terms. 

However,  an  important  part  of  information  is  ignored  by  the  aforementioned  tripartite  graph 
-  content  similarity  between  images,  which  is  the  essential  pattern  we  would  like  to  capture  in 
the  image-based  analysis.  Consequently,  we  propose  to  obtain  a  feature -based  representation 
of  expression  images,  and  connect  pairs  of  images  that  are  close  to  each  other  in  the  feature 
space.  Feature  values  are  borrowed  from  the  triangulated  images  in  [42],  which  employed  a 
number  of  image  processing  techniques  to  align  embryos  of  different  sizes,  shapes,  positions 
and  orientations,  as  well  as  to  remove  certain  imaging  artifacts.  And  we  find  3 ~  1 0  nearest 
neighbors  for  each  image  node  to  create  intra-layer  edges. 

4.2.2  Multi-Modal  Retrieval 

With  the  complete  graph  of  images,  genes  and  terms  as  well  as  links  between  them,  we  are  ready 
to  derive  the  proximity  measure  for  the  retrieval  task.  Here  we  employ  random  walk  with  restart 
(RWR)  on  graphs,  which  is  also  known  as  personalized  Pagerank  [83].  Given  a  query  node  i,  the 
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Stages  Included 

13-16 

11-16 

9-16 

7-16 

4-16 

1-16 

#  of  Nodes 

10,868 

23,008 

28,568 

34,141 

42,319 

49,261 

#  of  Edges 

99,113 

199,035 

237,555 

274,415 

336,354 

385,515 

Avg  Node  Degree 

9.2 

8.7 

8.3 

8.0 

7.9 

7.8 

Table  4.1:  A  summary  of  graphs  constructed  of  different  sizes. 


proximity  from  i  to  every  other  node  in  the  graph  can  be  derived  from  the  steady- state  probability 
of  the  following  discrete-time  Markov  process:  at  each  time  tick,  the  random  walker  makes  one 
of  two  possible  choices: 

1.  with  probability  (1  —  c),  randomly  pick  one  of  the  neighbors  of  the  current  node,  and  walk 
to  that  node, 

2.  with  probability  c,  jumps  back  to  the  query  node  i, 

where  c  is  known  as  the  restart  probability  and  empirically  set  to  0.1.  If  the  graph  has  weighted 
edges,  the  chance  of  picking  a  random  neighbor  under  the  first  choice  is  proportional  to  the 
weight  of  the  connecting  link.  Denote  the  corresponding  steady-state  probability  vector  by  ri, 
and  the  graph  adjacency  matrix  by  IT",  then  we  have 

Fi  =  (1  —  c)Wr{  +  cei  (4.1) 

where  Wjk  equals  the  probability  of  going  from  node  k  to  node  j  under  the  first  choice,  and  ei 
is  a  vector  of  which  the  zth  element  is  1  and  other  ones  are  0.  Eq.  4.1  can  be  solved  directly  to 
obtain  r;  =  c(J  —  (1  —  c)W)~1)ei.  However,  the  cost  matrix  inversion  would  be  prohibitive  for 
a  moderately  large  W.  An  iterative  power  method  is  applied  instead,  of  which  the  complexity  is 
linear  to  the  number  of  edges  for  a  sparse  graph.  More  elaborate  techniques  such  as  [108]  could 
be  employed  if  scalability  is  concerned. 

The  RWR  algorithm  applies  naturally  to  the  multi-modal  query  setting,  as  relevance  scores 
could  be  normalized  within  each  layer  of  node  respectively.  It  also  generalizes  smoothly  to 
multiple  query  inputs,  by  letting  e;  have  multiple  non-zero  entries,  each  of  which  corresponds  to 
one  of  the  candidate  nodes  under  the  random  walker’s  second  choice. 


4.3  Experimental  Evaluation 

To  shed  light  on  the  scalability  of  the  proposed  approach,  we  create  a  total  of  6  graphs  of  varying 
sizes  by  putting  in  additional  data  from  other  stage  ranges.  Statistics  are  summarized  in  Table  4. 1 . 
The  database  dump  and  feature  representation  of  images  were  the  same  as  [42]  and  downloaded 
from  the  BDGP  website.  Images  are  linked  to  each  other  in  the  feature  space  if  their  feature 
vectors  have  a  correlation  value  greater  than  0.7.  The  number  of  nearest  neighbors  linked  to  each 
image  is  constrained  to  be  between  3  and  10. 

We  measure  the  elapsed  time  of  the  graph  construction  algorithm  on  a  Linux  machine  with 
2.8GHz  cores  with  MATLAB  2009b  installed.  Figure  4.3(a)  plots  the  average  number  over  10 
repetitions  against  the  number  of  nodes  in  the  graph,  whereas  Figure  4.3(b)  reports  the  running 
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Figure  4.3:  Running  time  based  on  different  developmental  stage  ranges  (a)  for  constructing 
the  graphical  representation,  averaged  over  10  repetitions;  and  (b)  for  one  query  using  RWR, 
averaged  over  100  random  queries,  with  error  bars  of  one  standard  deviation. 


Rank 

Image  Results 

Genes  (Synonyms) 

Annotation  Terms 

1 

insitu67039  (Fig.  4.4(a)) 

CGI 0498  (cdc2c) 

ventral  nerve  cord 

2 

insitu64954  (Fig.  4.4(b)) 

CG15141 

brain 

3 

insitu28800  (Fig.  4.4(c)) 

CG5581  (Ote) 

central  nervous  system 

4 

insitu67041  (Fig.  4.4(d)) 

CG10212  (SMC2) 

midgut 

5 

insitu35317  (Fig.  4.4(e)) 

CGI 245  (MED27) 

neurons 

Table  4.2:  C-DEM  query  results  using  the  query  image  shown  in  Figure  4.4(a). 


time  for  executing  query  from  a  random  node  of  graphs  constructed.  Both  curves  show  a  lin¬ 
ear  trend  as  graph  sizes  go  up,  and  for  the  largest  graph  containing  almost  50,000  nodes,  the 
time  needed  for  graph  construction  and  querying  are  no  more  than  5  minutes  and  0.5  seconds, 
respectively. 

Table  4.2  and  Figure  4.4  provide  top  five  query  results  for  each  modal  using  an  image  with 
visible  expression  patterns  in  the  anterior  (near  the  head)  and  ventral  (near  the  belly)  part  of 
the  embryo  as  the  query  input.  All  five  images  display  similar  patterns  to  the  query.  The  cor¬ 
responding  gene  for  each  of  these  images  is  also  in  the  list  of  most  relevant  genes.  Three  of 
the  top  five  genes  (cdc2c,  Ote,  SMC2)  have  been  identified  as  mitotic  genes,  which  is  related 
to  the  mitosis  (M)  phase  in  the  cell-division  cycle,  in  an  independent  study  [102].  Relevance 
scores  for  top  annotation  terms  are  0.15,  0.15,  0.12,  0.07  and  0.04,  respectively.  Top  two  terms 
(ventral  nerve  cord  and  brain)  are  part  of  the  annotation  of  every  image  in  the  top-five 
list,  whereas  the  third  term  (central  nervous  system)  is  shared  by  all  the  images  but 
insitu28800.  This  may  lead  to  a  specific  annotation  suggestion  of  central  nervous 
system  for  the  image  insitu2880.  And  we  would  like  to  automate  this  task  in  our  future 
work. 
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(a)  Graph  Construction 


(b)  Query  using  RWR 
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(a)  insitu67039  (d)  insitu67041  (e)  insitu35317 


Figure  4.4:  A  typical  query  result  using  an  embryo  image  in  (a)  as  the  query  input.  Top  4  similar 
images  other  than  the  query  image  itself  are  displayed  in  (b)-(e). 


(a)  Online  Interface 


HTTP 


RMI 


(b)  System  Architecture 


Figure  4.5:  C-DEM:  an  online,  multi-modal  query  system  for  Drosophila  embryo  databases. 
Images  are  adapted  from  [48]. 


Figure  4.5(a)  illustrates  the  online  interface  of  the  C-DEM  query  system.  The  query  input 
could  be  an  image,  and/or  a  gene,  and/or  an  annotation  term.  The  software  architecture  of  C- 
DEM  in  Figure  4.5(b)  de-associates  the  front-end  web  server  and  back-end  computing  engine 
with  a  clear  and  stable  API.  They  are  deployed  on  separate  machines  for  better  performance  by 
distributing  the  workload.  Detailed  discussion  on  how  to  query  and  browse  the  database  using 
C-DEM  is  given  in  [48]. 
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4.4 


Related  Work 


4.4.1  Automatic  Analysis  of  Embryo  Images 

A  first  analysis  of  the  data  set  came  from  [106,  107]  by  the  BDGP  group:  [106]  performed 
co-clustering  of  the  gene-annotation  matrix;  [107]  incorporated  the  microarray  data  for  gene 
clustering  and  applied  a  fuzzy  clustering  algorithm  which  allowed  a  gene  to  belong  to  multiple 
clusters.  Both  of  them  only  made  indirect  use  of  image  data  through  the  manual  annotation 
results.  Moreover,  [107]  provided  a  network  representation  of  collapsed  annotation  terms  which 
reflect  tissue  relatedness. 

[119]  developed  algorithms  to  determine  the  stage  of  an  image  and  perform  automatic  anno¬ 
tation.  Since  an  image  may  have  multiple  annotation  terms,  annotation  was  treated  as  a  series  of 
simple  bi-class  classification  problems.  Image  features  were  obtained  using  2D  wavelet  discrete 
transform  and  a  feature  selection  algorithm  from  previous  work  [89].  The  same  author  introduced 
in  [90]  additional  features  based  on  Gaussian  mixture  model  (GMM)  and  principle  component 
analysis  (PCA)  to  characterize  local  and  global  patterns.  Their  proposed  algorithm  performed 
very  well  in  the  task  of  automatically  finding  the  developmental  stage  given  an  expression  image 
(>99%  accuracy),  however,  automatic  annotation  turned  out  to  be  a  much  more  challenging  task, 
and  even  finding  an  intuitive,  objective  evaluation  scheme  seems  to  be  nontrivial. 

[52]  argued  that  the  pure  visual  feature  based  retrieval  method  cannot  be  applied  to  find  cor¬ 
respondence  between  images  in  different  developmental  stages.  The  correspondence  should  be 
established  through  the  annotation  terms  which  are  from  the  same  controlled  vocabulary  inde¬ 
pendent  of  developmental  stages.  A  max-margin  based  algorithm  designed  for  multi-modal  data 
mining  was  applied  to  obtain  empirical  evaluation  on  the  BDGP  database.  The  algorithm  outper¬ 
formed  another  state-of-the-art  multi-modal  mining  algorithm  in  precision-recall  for  most  of  the 
retrieval  tasks  evaluated,  and  scaled  well  with  the  size  of  the  dataset.  However,  more  biological 
evidence  need  to  be  provided  to  support  this  special  treatment  of  “across-stage  retrieval  ”,  and 
this  framework  may  limit  some  global  image  features. 

The  FEMine  system  [85]  derived  two  sets  of  features  by  applying  PCA  and  ICA  (Independent 
Component  Analysis)  respectively,  and  performed  classification,  clustering  and  retrieval  tasks 
based  on  these  features.  [85]  provided  a  detailed  discussion  on  image  preprocessing,  which  will 
be  adapted  as  an  important  component  in  the  current  project.  The  major  concern  comes  from  the 
experimental  evaluation  where  images  were  hand-picked  from  the  BDGP  dataset  and  the  size  of 
the  experimental  data  need  to  be  significantly. 

Random  walk  and  related  methods  have  many  successful  applications,  of  which  the  most 
well  known  one  is  PageRank  [83].  [84]  applied  random  walk  with  restart  to  a  simple  automatic 
caption  setting. 

4.4.2  Online  Databases 

The  database  created  by  BDGP  provides  a  query  interface  where  users  provide  gene  name,  devel¬ 
opmental  stage,  and/or  annotation  information  and  search  engine  returns  corresponding  images. 
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The  FlyExpress  database  [65]  offers  a  content  based  image  retrieval  function  named  “find 
similar  patterns”,  where  users  first  choose  an  extracted  pattern  of  an  image  from  a  small  set 
of  candidates,  then  the  search  engine  returns  a  list  of  images  with  similarity  score.  Both  web 
sites  refer/link  to  the  well  known  Flybase  [112]  for  detailed  information  about  gene  function, 
synonym  etc. 


4.5  Conclusion 

We  presented  an  online  interface  to  assist  biological  researchers  to  perform  flexible  querying  and 
exploration  over  a  large  database  which  consists  of  embryo  images,  image  annotations,  as  well 
as  genes  whose  expression  patterns  are  illustrated  by  these  images.  Given  an  input  query  from 
any  data  modal,  image  or  text,  the  system  could  automatically  and  efficiently  search  over  the 
entire  database  and  output  an  ordered  list  of  most  similar  images  or  formatted  attributes.  The 
underlying  proximity  measure  is  derived  via  random  walk  with  restart  algorithm  over  graphs. 
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Chapter  5 

BEFH:  Bayesian  Exponential  Family 
Harmoniums  for  Multimedia  Retrieval 


The  vast  size  of  the  text  and  multimedia  information  available  from  digital  libraries  and  World 
Wide  Web,  and  large  amount  of  knowledge  contained  therein,  creates  a  need  to  organize  and 
summarize  topical  contents  of  these  data.  In  recent  years,  there  is  a  growing  volume  of  research 
on  applying  probabilistic  graphical  models  to  develop  automatic  information  distillation  systems 
that  can  explore  and  exploit  real-world  data  from  diverse  sources,  such  as  texts,  images  and  bio¬ 
logical  sequences.  This  chapter  presents  a  Bayesian  approach  to  inference  and  learning  with  the 
recently  proposed  exponential  family  harmonium  (EFH)  models  and  their  variants  for  posterior 
latent  semantic  projection  of  multimedia  documents  for  subsequent  data  mining  tasks  such  as 
classification  and  retrieval. 

We  first  provide  a  table  listing  common  acronyms  in  this  chapter  for  reference. 


Table  5.1:  Summary  of  Acronyms 


Acronym 

Explanation 

EFH 

BEFH 

GB-EFH 

DWH 

LSI 

pLSI 

LDA 

Exponential  Family  Harmonium,  a  family  of  undirected  graphical  model 

Bayesian  extension  of  EFH 

A  special  case  of  EFH  with  variable  distributed  in  binary /Gaussian 

Dual-Wing  Harmonium,  a  generalization  of  EFH  for  multi-modal  data 

Latent  Semantic  Indexing,  a  classical  method  for  topic  discovery 

Probabilistic  Latent  Semantic  Indexing,  a  classic  probabilistic  topic  model 

Latent  Dirichlet  Allocation,  a  generative  Bayesian  graphical  model  for  topic  discovery 

5.1  Introduction 

Probabilistic  graphical  models  provide  a  compact  description  of  complex  stochastic  relation¬ 
ships  among  random  variables,  which  can  correspond  to  both  perceivable  entities  (e.g.,  words, 
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imageries)  and  abstract  concepts  (e.g.,  topics,  themes).  Such  a  formalism  often  facilitates  flex¬ 
ible  statistical  reasoning  and  query  answering  based  on  appropriate  computational  algorithms. 
Inspired  by  the  classical  approach  of  latent  semantic  indexing  [34],  at  the  beginning  of  this  cen¬ 
tury  there  have  been  important  advances  in  developing  latent  semantic  graphical  models  for  large 
text  corpora  and  multimedia  data,  based  on  either  a  Bayesian  network  or  a  Markov  random  field 
(MRF)  formalism.  For  instance,  the  probabilistic  latent  semantic  indexing  (pLSI)  [56]  method 
models  each  document  as  an  admixture  of  topic-specific  distributions  of  words.  The  more  recent 
latent  Dirichlet  allocation  (LDA)  technique  [18]  employs  a  hierarchical  Bayesian  extension  of 
pLSI,  treating  both  the  document-specific  topic-mixing  coefficients  and  the  topic-specific  word 
probabilities  as  random  variables,  under  appropriate  conjugate  priors.  LDA  can  be  extended  to 
multimedia  collections  by  assuming  that  the  unobserved  “topics”  are  correlated  with  both  im¬ 
age  variables  and  word  variables  [10,  16].  Recently,  Welling  et  al.  [113]  proposed  another  class 
of  latent  semantic  graphical  models  known  as  the  exponential  family  harmonium  model  (EFH), 
which  can  be  understood  as  an  undirected,  and  non-Bayesian  counterpart  of  the  LDA  model. 
Subsequently,  [114]  extended  EFH  to  a  dual-wing  harmonium  model  (DWH)  for  joint  modeling 
of  text  and  image.  Also,  Gehler  et  al.  [43]  proposed  the  rate  adapting  Poisson  (RAP)  model 
which  follows  the  general  architecture  of  EFH  model  and  use  conditional  Poisson  distributions 
to  model  observed  count  data.  And  McCallum  et  al.  [78]  proposed  a  training  criterion  called 
multiple-conditional  learning  (MCL)  for  MRFs  and  EFHs.  Unlike  the  directed  graphical  mod¬ 
els  such  as  pLSI  and  LDA,  EFH  does  not  employ  auxiliary  latent  variables  (i.e.,  the  imaginary 
topic  indicators  for  every  word)  to  facilitate  topic  mixing  and  simulate  data  generation;  and  it 
allows  a  more  flexible  representation  of  the  latent  topic  aspects  for  documents  (i.e.,  as  a  point  is 
a  Euclidean  space  rather  than  in  a  simplex). 

An  important  advantage  of  the  directed  latent-topic  models  such  as  LDA  is  that  they  can 
be  straightforwardly  embedded  in  a  Bayesian  framework,  and  can  undergo  Bayesian  training, 
smoothing  and  inference.  To  date,  the  MRF-based  models  such  as  EFH  and  DWH  have  been 
largely  limited  to  a  maximum  likelihood  (ML)  learning  framework,  which  is  prone  to  undesir¬ 
able  effects  such  as  overfitting  over  a  relatively  smaller  data  set,  high  variance  in  sampling-based 
inference  and  parameter  estimation,  and  indifference  to  prior  knowledge.  These  limitations  re¬ 
strict  their  utilities  in  many  realistic  data  mining  scenarios  where  data  are  sparse  and  spurious. 
The  ML  framework  also  makes  it  difficult  to  fully  exploit  the  modeling  power  of  MRF  in  latent 
topic  distillations  and  to  develop  future  extensions.  The  unavailability  of  a  Bayesian  version 
of  EFH  is  partly  due  to  the  remarkable  technical  difficulties  one  must  overcome  when  working 
under  such  a  formalism.  It  is  well-known  that  statistical  learning  of  EFH  models  from  data, 
even  under  an  ML  framework,  is  technically  non-trivial.  As  discussed  in  [81]  and  [91],  Bayesian 
learning  for  general  MRF,  is  even  more  challenging,  particularly  in  cases  that  involve  latent  vari¬ 
ables  as  in  EFH.  In  this  chapter,  we  attempt  to  address  some  of  this  challenges:  endowing  EFH 
with  a  simple  Bayesian  prior,  and  presenting  a  sampling-based  algorithm  for  Bayesian  inference 
and  learning. 

In  summary,  we  present  Bayesian  EFH  (BEFH),  in  which  a  multivariate  Gaussian  prior  is 
introduced  for  the  weight  matrix  that  couples  the  latent  topics  with  observed  attributes  in  EFH 
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(and  also  in  DWH).  As  detailed  subsequently,  it  is  illuminative  to  view  the  weight  matrix  of  EFH 
as  the  matrix  of  word  probabilities  under  all  topics  in  LDA.  Under  this  analogy,  our  prior  corre¬ 
sponds  to  the  Dirichlet  priors  for  the  word  probabilities  in  LDA.  It  is  well-known  that  methods 
for  Bayesian  inference  and  learning  in  directed  graphical  models  such  as  LDA  does  not  apply  to 
the  undirected  graphical  models  concerned  here,  because  of  the  intractability  and  non-conjugacy 
arising  from  the  generally  intractable  partition  function.  In  this  chapter,  we  present  the  Langevin 
algorithm  conjoint  with  a  MCMC  sampling  scheme  for  posterior  inference  under  BELH.  We  also 
propose  an  empirical  Bayes  method  based  on  the  Langevin  algorithm  for  unsupervised  estima¬ 
tion  of  the  BEFH  hyper-parameter  given  training  data.  Finally  we  show  comparisons  of  ML  and 
Bayesian  approaches  on  a  synthetic  dataset  with  known  parameters  and  a  dataset  provided  by 
TRECVID  2003  [101]  with  both  text  and  image  data. 


5.2  Bayesian  EFH 

In  this  section,  we  outline  the  basic  structure  of  a  Bayesian  EFH  in  the  context  of  a  simple 
instantiation  of  EFH  for  latent  topic  modeling  of  text  corpora. 

Prior  to  delving  into  technical  discussion,  we  provide  a  summarization  of  symbols  to  be 
referred  multiple  times  in  Table  5.2. 

We  commence  the  discussion  with  a  brief  recap  of  the  basic  EFH,  as  described  in  [113]. 
Consider  an  undirected  graphical  model  defined  on  a  complete  bipartite  graph  containing  two 
layers  of  nodes  (Fig  5.1).  Let  H  =  {Hj}  denote  the  set  of  hidden  units  in  such  a  graph,  and  let 
X  =  {Xi}  denote  the  set  of  input  units.  An  EFH  defines  the  following  Markov  random  field: 

p(x,  h)  OC  exp  |  Y  °iafia(Xi )  +  Y  ^jbgjb(hj)  +  Y  Wiafia(Xi)gjb(hj )},  (5.1) 

ia  jb  ijab 

where  {/ia(-)  ■  Va}  denotes  the  set  of  potential  functions  (or  features)  defined  on  each  of  the 
input  units  (indexed  by  i )  in  the  model,  and  likewise  { ( • )  :  V6}  for  the  hidden  units;  0  = 
{Oia}  U  { Ajfo}  U  {W%}  denotes  the  “weights”  of  the  corresponding  potentials  or  potential  pairs; 
and  Z  stands  for  the  partition  function,  which  is  a  function  of  0. 

The  bipartite  topology  of  the  harmonium  graph  suggests  that  nodes  within  the  same  layer 
are  conditionally  independent  given  all  nodes  of  the  opposite  layer.  Specifically,  from  Eq.  5.1, 
we  have  the  following  factored  form  for  the  between-layer  conditional  distribution  functions: 
p(x|h)  =  J^[.p(xj|h),  p(h|x)  =  Yl  jP(hj  |x),  and  each  of  the  singleton  conditional  has  a  simple 
exponential  family  form: 


p(xi\h)  =  exp  {  Y  6iafia{xi )  -  Ai({6ia})}, 

CL 

(5.2) 

p(hj |x)  =  exP  {  Y  ^jbgjb(hj)  -  Bj({Xjb})}, 

(5.3) 

b 


where  A,  ( ■ )  and  Bj(-)  denote  the  respective  log-partition  functions;  and  the  “shifted”  parameters 
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Symbol 

Definition 

X 

observed  units  (random  variables)  in  the  harmonium  model 

X, 

the  / 1 h  observed  unit 

X 

a  vector  of  observed  values  as  an  instantiation  of  X 

Xi 

the  /th  entry  of  x 

I 

the  total  number  of  observed  units 

H 

hidden  units  (random  variables)  in  the  harmonium  model 

Hj 

the  j  th  hidden  unit  representing  topics 

h 

a  vector  of  hidden  values  as  an  instantiation  of  H 

hj 

the  y  th  entry  of  h 

J 

the  total  number  of  hidden  units 

fia 

the  ath  potential  function  associated  with  the  /th  observed  unit 

Oia 

the  parameter  associated  with  fia 

e 

the  vector  of  such  parameters  when  each  observed  unit  has  a  single  potential  function 

9jb 

the  6th  potential  function  associated  with  the  jth  hidden  unit 

Xjb 

the  parameter  associated  with  gjb 

X 

the  vector  of  such  parameters  when  each  hidden  unit  has  a  single  potential  function 

wjb 

r  r  ia 

the  a,  6th  potential  function  associated  with  the  /th  observed  unit  and  /th  hidden  unit 

w 

the  matrix  of  such  parameters  when  each  unit  has  only  one  potential  function 

columns  of  W  follows  iid  multivariate  Gaussian  of  dim-/;  this  is  the  mean  vector 

e2 

the  vector  of  non-zeros  elements  in  the  diagonal  covariance  matrix 

Hi, 

the  mean  and  variance  parameter  for  Wn, . . . ,  li/y 

z 

the  intractable  partition  function 

e2 

the  discretization  size  of  the  Fangevin  algorithm  in  Section  5.3.2 

N 

the  size  of  the  data  set,  or  the  number  of  independent  observations  available 

m 

#  samples  obtained  by  repeatedly  doing  brief  sampling  as  described  in  Sectoin  5.3.3 

V 

a  I  x  I  matrix  equivalent  to  WWT 

Table  5.2:  Symbol  table 


6ia  and  Xjb  are  defined  as: 

L  =  eia  +  Y,w?a9AhJ), 

jb 

A  =  Ajt  +  53  w£Uxt), 

ia 


where  the  shifts,  i.e.,  the  second  term  in  each  of  the  equation  above,  are  induced  by  the  total 
couplings  between  units  in  the  input  and  hidden  layers.  As  seen  from  the  above  definition,  since 
all  the  parameters  in  the  joint  distribution  under  EFH  can  be  identified  from  the  local  conditional 
distributions,  one  can  determine  an  EFH  using  a  bottom-up  strategy  by  directly  specifying  the 
often  easily  comprehensible  local  conditionals.  For  instance,  as  our  running  example  in  this 
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Figure  5.1:  The  graphical  model  representation  for  a  harmonium  with  3  input  units  (e.g.,  binary 
word  occurrences  in  a  document)  and  2  hidden  units  (e.g.,  projection  in  a  2-dim  topic  space). 

chapter,  we  define  the  following  Gaussian-Bemoulli  EFH  (GB-EFH)  for  text: 

p(xi\\i)  =  Bernoulli (xj | logit (Oj  +  hjWjj)),  (5.4) 

j 

p(hj\x)  =  1),  (5.5) 


where  logit(a)  =  (1  +  e~a)  1  is  the  logistic  function,  and  the  shift  of  the  logit-transformed 
Bernoulli  rate  is  induced  by  a  weighted  combination  of  the  latent  units  h.  It  can  be  shown  that 
under  this  construction,  we  obtain  an  EFH  with  the  joint: 

X 

p(x,  h)  oc  exp  |0Tx  —  -hTh  +  xTWhj.  (5.6) 

The  GB-EFH  models  text  (represented  by  variables  x)  as  binary  occurrences  of  words,  which  is 
suitable  for  sparse,  short  text  such  as  video  captions.  When  modeling  long  articles,  one  may  want 
to  directly  model  word  counts;  and  in  this  case  one  can  replace  Eq.  5.4  with,  e.g.,  a  Binomial  dis¬ 
tribution.  It  is  interesting  to  examine  side-by-side  the  GB-EFH  and  the  LDA  model  as  displayed 
in  Fig.  5.2.  Note  that,  when  treating  each  hidden  unit  hj  as  a  representative  of  a  latent  topic  as¬ 
pect,  Eq.  5.4  can  be  understood  as  a  likelihood  function  of  an  observed  attribute,  such  as  a  word 
occurrence,  induced  by  a  combination  of  topics.  Thus,  the  coupling  matrix  W  =  {Wi, . . . ,  Wj} 
in  GB-EFH  is  reminiscent  of  the  word  probability  matrix  B  =  {/3i, . . . ,  /3j}  in  LDA,  where  /3j 
denotes  the  M -dimensional  vector  (M  denotes  the  size  of  the  vocabulary)  of  multinomial  word 
probabilities  under  topic  j.  In  GB-EFH  each  M-vector  Wj  represents  the  set  of  “contributions" 
topic  j  as  on  each  word  in  a  vocabulary.  Although  structurally  similar,  it  is  noteworthy  that  the 
topic  mixing  mechanism  of  GB-EFH  is  very  different  from  that  of  the  LDA  model.  In  LDA 
the  topic  mixing  is  achieved  by  marginalizing  out  the  auxiliary  topic  indicator  variables  for  each 
word  occurrence  -  as  illustrated  in  Fig.  5.2(b),  the  LDA  likelihood  of  a  word  xw,  given  topic 
mixing  coefficient  6  and  the  probabilities  of  this  word  under  all  J  topics,  {/3w,i,  ■  ■  ■ ,  /3w,j}  can  be 
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(a) 


(b) 


(c) 


Figure  5.2:  A  comparison  of  EFH,  LDA  and  BEFH  models  over  a  single  document.  Circles 
represent  variables,  and  diamond  represents  model  parameters,  (a)  EFH.  For  easy  comparison, 
the  hidden  unit  (i.e.,  the  topic  weight  coefficients)  { H3 }  and  the  input  units  { X, }  are  represented 
as  vector  valued  variables  H  and  A",  respectively.  For  simplicity,  only  the  W  parameter  of  EFH 
is  explicitly  shown,  (b)  LDA.  Note  the  correspondence  between  7r  in  LDA  and  H  in  EFH,  and  the 
fact  that  j3 j' s  are  random  variables  rather  than  parameters.  /  denotes  the  length  of  the  document, 
(c)  BEFH.  Note  that  W  =  {  W3 }  are  now  lifted  as  random  variables. 

written  as  p(xw\0 )  =  ’}2zp(z\6)p(xw\'B,  z)  =  JA  QjPwj  ~  whereas  it  can  be  shown  that  in  EFH 
the  expected  rates  of  all  words  are  directly  determined  by  the  weighted  sum  of  topic  specific  con¬ 
tributions  Y2j  hjWj  =  Wh.  In  this  regard  EFH  is  closer  to  the  classical  LSI  principle  in  which 
the  observed  rates  of  all  words  can  be  expressed  as  a  weighted  combinations  of  the  eigen-topics 
(i.e.,  orthonormal  topic-specific  word  rate  vectors). 

Empirically,  it  was  noted  that  the  performance  of  EFH  and  variants  on  latent  semantic  mod¬ 
eling  is  comparable,  and  sometimes  superior,  to  LDA  [113,  114].  But  as  shown  in  Fig.  5.2, 
structurally  EFH  is  not  yet  a  full  undirected  counterpart  of  LDA,  which  employs  an  elegant 
hierarchical  strategy  to  incorporate  priors  for  both  the  word  probabilities  B  and  topic  mixing 
coefficients  n.  We  expect  that,  as  is  the  case  for  LDA,  it  is  possible  for  EFH  to  also  leverage  on 
the  possible  extra  modeling  power  endowed  by  a  Bayesian  formalism. 

Now  we  propose  a  Bayesian  EFH  that  exploits  the  proclaimed  benefits.  To  maintain  ex- 
changability  between  hidden  units  {hj},  we  place  column-iid  prior  on  W,  that  is,  each  column 
of  W  follows  a  multivariate  Gaussian,  which  is  a  common  choice  for  modeling  continuous  pa¬ 
rameters  without  any  additional  assumption: 


j 


j 


p(w> = n  pw) = n  a/w>,e). 


(5.7) 


A  full  covariance  matrix  in  the  above  prior  would  have  size  M2,  which  is  prohibitively  expensive 
for  modeling  large  vocabulary.  For  simplicity,  we  consider  a  further  simplification  where:  £  = 
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diag(cr),  i.e.  T,,:)  =  This  means  that  each  element  of  W  follows  an  independent 

normal  distribution.  Note  that  although  we  omit  correlations  between  the  topic-word  coupling 
coefficients,  the  expressiveness  of  this  prior  is  comparable  to  the  Dirichlet  prior  for  columns 
in  the  B  matrix  of  FDA,  which  captures  little  correlation  behavior  of  the  word-probabilities 
sampled  from  a  simplex. 

Now  we  are  left  with  two  remaining  sets  of  parameters  of  EFH:  6  and  A.  It  turns  out  that  in 
many  practical  settings  (e.g.,  GB-EFH  and  DWH),  A  is  vacuous,  i.e.,  A  =  0,  which  essentially 
“centers"  the  conditional  distribution  p(h|x)  at  the  shifts  induced  by  the  input  units.  For  6, 
in  EFH  it  lacks  an  intuitive  semantics,  such  as  being  a  prior  for  topic  coefficients  as  in  FDA. 
Therefore  we  choose  to  leave  6  as  fixed  parameters  to  be  estimated  via  a  maximum-likelihood 
principle  or  cross-validation  techniques. 

Now,  putting  things  together,  we  arrive  at  a  Bayesian  EFH  model  with  the  following  joint 
density  function 

p(x.  h.  W| 0,  n,  er)  =  p( Wj/Lt,  cr)p(x,  h|0,  W).  (5.8) 

The  hyperparameters  in  the  model  are  //  and  a,  which  we  treat  as  fixed  quantities  presumably 
known  or  to  be  estimated. 


5.3  Model  Inference  and  Estimation 


5.3.1  Algorithm  Overview 

Given  the  prior  distribution  on  W  with  presumably  known  hyperparameters  and  a  collection 
of  N  riri-sampled  data  X  =  (xi, . . . ,  xjv),  also  assume  that  parameters  9  are  known  or  already 
estimated,  we  need  to  compute  or  approximate  the  posterior 

P(W|X)  cc  p(X|W)p(W)  =  — p(X|W)p(W)  (5.9) 

and  the  predictive  posterior  density  over  hidden  variables 


p(h|x,X) 


p(h|x,  W)p(W|X)dW, 


(5.10) 


where  p(-)  in  Eq.  5.9  represents  the  unnormalized  density  function  corresponding  to  p(-). 

We  can  take  a  Monte  Carlo  approach  to  obtain  a  set  of  m  samples  (Wj, . . . ,  Wm}  by  sim¬ 
ulating  an  ergodic  Markov  chain  whose  stationary  distribution  is  the  posterior  p(W|X).  The 
difficulty  here  is  due  to  the  presence  of  an  intractable  term  (1/X(  W)  );V  in  the  posterior  distribu¬ 
tion,  which  is  a  function  of  the  target  random  parameters  W.  Therefore,  unlike  simple  posterior 
inference  settings  in  which  there  is  a  normalization  constant  that  will  be  canceled  out  by  comput¬ 
ing  the  ratio  of  two  posterior  densities  or  taking  the  derivative,  in  Bayesian  inference  with  MRFs 
using  MCMC  we  have  to  seek  an  efficient  approximation  of  the  intractable  random  partition 
function  in  posterior  distribution. 
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In  the  following,  we  investigate  two  MCMC  approximation  schemes  and  show  that  in  both 
cases  the  intractable  term  can  be  written  as  expectations  under  the  data  distribution  p(x|W). 
Then  we  show  that  these  terms  can  be  approximated  efficiently  by  minimizing  the  contrastive 
divergence  (CD)  [55],  or  equivalently,  by  performing  Gibbs  sampling  for  only  very  few  steps 
starting  from  data  (the  empirical  distribution).  The  derivation  is  in  parallel  with  that  in  [81];  here 
we  provide  a  more  detailed  discussion  on  the  comparison  of  the  two  schemes. 


5.3.2  Approximation  schemes 

Metropolis-Hasting  algorithms 

Consider  simulating  a  Markov  chain  using  a  Metropolis-Hasting  algorithm  with  the  proposal 
distribution  g(W'|W).  The  acceptance  probability  of  the  transition  W  — >  W'  is 


P(W,W') 


A>(W'|X)  g(W|W')  \ 
Vp(w|x)  q{w'\wy  ) 


(5.11) 


Suppose  the  proposal  distribution  is  easy  to  draw  sample  from  and  is  tractable,  then  the  only 
difficulty  in  implementing  Metropolis-Hasting  algorithms  is  to  approximate  the  intractable  term 

(  z{ w' ) )  ’  where  N  is  the  size  of  the  data  set.  The  ratio  of  two  partition  functions  can  be  written 
as  an  expectation  over  the  data  distribution  p(x|  W'). 


Z(W) 
Z(  W') 


T,  peTx+ixTw'w'Tx 
— W'W'T)x  e  2 


E]xt(wWt 

Z(W') 

exp  jixT  (WWT  -  W'W,T)  x  j 


p(x|W') 


(5.12) 


The  Langevin  algorithm 

We  also  investigate  the  Langevin  algorithm  as  an  alternative  approximate  MCMC  scheme.  The 
Markov  chain  simulated  by  the  Langevin  algorithm  is  characterized  by  the  following  stochastic 
transition  equation 

e2 

W'  =  W  +  —  Vlogp(W|X)  +  eNw  (5.13) 

where  Nw  are  randomly  generated  from  J\f( 0,  I\w\)-  This  is  a  discrete  version  of  the  Langevin 
diffusion  and  e2  corresponds  to  the  discretization  size.  A  diffusion  is  a  continuous  time  process 
which  can  be  defined  by  a  stochastic  differential  equation.  The  Langevin  diffusion  is  character¬ 
ized  by 

dW(t)  =  ^Vlogp(W(t)|X)dt  +  dB(t)  (5.14) 

where  B(t)  is  a  |W|-dimensional  Brownian  motion.  The  Markov  chain  converges  when  e  is 
reasonably  small  and  has  the  desired  density  p(W|X)  as  e2  — *  0.  The  gradient  of  the  posterior  is 
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the  sum  of  three  terms 

V  logp(W|X)  =  Vlogp(X,W)  =  V  logp(W)  +  V  logp(X|W)  +  (— iVV  log  Z(W))  (5.15) 
where  in  the  GB-EFH  model 


{Vlogp(W)}.  4  al°gP<W>  =  a  ) 

'  g'  -1"  dWij  3W„  <i? 1  “  '' 


y 


and  V  logp(X|  W)  is  also  tractable 

V  logp(X|  W)  =  y  v  logp(xj|W)  =  x*x7  W  =  XXTW. 


(5.16) 


(5.17) 


Hence,  the  only  intractable  term  involved  in  the  Langevin  algorithm  is  iVV  log  Z(W),  in  which 
V  log  Z (W)  can  be  written  as  an  expectation  over  the  data  distribution  p(x|  W) 

V  log  Z( W)  =  £  Vp(x|W)  =  X  V  logP(x|W) 

=  V  p(x|  W)  V  logp(x|  W)  =  (xxTw\  (5.18) 

J  \  /  p(x|W) 


Discussion  on  the  two  schemes 

The  straightforward  approach  of  estimating  Z(W)  itself  often  fails  to  provide  reliable  estimates. 
To  provide  some  intuition  of  the  nature  of  this  difficulty,  we  give  a  brief  illustration  as  follows: 
with  some  mathematical  manipulation,  the  partition  function  in  the  BG-EFH  model  equals  the 
expectation  of  the  following  random  variable 

Z(t)  =  l[(l  +  U)  (5.19) 

i 

under  the  multivariate  lognormal  distribution  of  t 

t  ~  LogNormal(0.  WW2 ) 

Thus  under  the  Bayesian  framework  in  which  W  is  considered  a  random  matrix,  we  should 
expect  Z( W)  to  have  exponential  mean  and  variance. 

Thus,  we  put  more  emphasis  on  variance  in  the  bias-variance  tradeoff  of  estimators  in  ap¬ 
proximate  Bayesian  learning.  Compare  the  approximations  in  the  Langevin  algorithm  to  update 
W  as  in  Eq.  5.13  and  in  Metropolis-Hasting  algorithms  to  compute  the  acceptance  probability 
as  in  Eq.  5.11 

e2V  logp(W|X)  =  —  e2iVV  log  Z(  W)  +  C  (5.20) 

(  ^(W)  ^  _  N  (W-W')iVVloKZ(W')  rc 

[z{W))  ~e  e  (5'21) 

where  C  =  e2(V  logp(W)  +  V  logp(X|  W))  can  be  computed  exactly,  and  Eq.  5.21  is  obtained 
by  first-order  Taylor  expansion.  We  expect  the  latter  approximation  has  exponential  variance 
compared  to  the  former  one.  Therefore,  we  choose  the  Langevin  algorithm  conjoint  with  the 
MCMC  scheme  for  posterior  inference  on  BEFH  model. 


57 


5.3.3  Approximating  the  expectations  with  brief  sampling 

V  log  Z(W)  in  Eq.  5.18  can  be  estimated  using  a  “sampling  very  few  steps  from  the  data"  tech¬ 
nique.  It  is  first  proposed  in  [55]  under  the  name  of  minimizing  contrastive  divergence  (CD) 
and  suggested  by  [81]  for  approximate  Bayesian  inference  in  MRF  in  which  it  is  named  brief 
sampling.  Brief  sampling  in  GB-EFH  runs  multiple  chains  starting  from  the  data  X.  Each 
chain  performs  l  full  step  of  Gibbs  sampling.  A  total  of  N  samples  are  obtained,  denoted  by 
Xi  =  (x(j°,...,x$).  Then  V  log  X (W)  is  approximated  as  an  expectation  over  the  empirical 
distribution  of  X This  whole  procedure  of  brief  sampling  is  illustrated  as  follows  where  we  set 
l  =  1  to  perform  just  a  single  iteration: 

•  Draw  hj^  ~  A/’(W1xfc,  If)  for  k  =  1, . . . ,  N; 

•  Draw  xj.,1 }  ~  Bernoulli  (logit((0  +  Wh®)))  for  k  —  1, . . . ,  N\ 

•  VlogZ(W)  «  I£,  4«(4‘>)TW  =  &x,xfw 

Brief  sampling  has  been  previously  shown  to  provide  low  variance  estimation  with  a  small 
bias  in  ML  learning  [22].  The  intractable  term  in  ML  learning  of  MRF  is  just  the  same  term 

V  log  Z (W),  therefore  we  expect  similar  low-variance  behavior  of  brief  sampling  estimation  in 
the  Langevin  algorithm.  Fig.  5.4  in  the  experiment  section  provides  an  empirical  demonstration. 


5.3.4  Computing  the  predictive  posterior  density 

Given  m  samples  (Wi, . . . ,  W,„}  obtained  by  the  Langevin  algorithm  with  brief  sampling  de¬ 
scribed  above,  the  predictive  conditional  distribution  is  approximated  by 

-  m 

p(h|x,X)  =  —  VVh|x,  Wfc).  (5.22) 

m 

k= 1 

More  specifically,  in  GB-EFH  we  are  interested  in  the  conditional  expectation  of  h  given  x,  this 
is  computed  as 

1  m  1  m 

E  (h|x,  X)  =  -  V  E  (h|x,  Wk)  =  -  V  Wjx.  (5.23) 

m  m  ■ ' 


5.3.5  Hyperparameter  Estimation 

Now  we  briefly  outline  how  to  compute  the  maximum  likelihood  estimates  of  the  hyperparame¬ 
ters  p  and  a  of  BEFH  from  training  data,  based  on  an  empirical  Bayes  principle.  We  employ  a 
Monte  Carlo  EM  scheme.  In  the  “E"-step,  we  impute  the  hidden  variables  in  BEFH,  specifically, 
W,  from  its  posterior  distribution;  and  in  Section  5.3.2  we  have  developed  the  Langevin  algo¬ 
rithm  for  this  step.  Given  a  set  of  K  imputed  W  from  iteration  t,  we  proceed  to  the  “M"-step,  in 
which  now  we  are  essentially  back  to  the  standard  ML  learning  scenario  for  fully  observed  MRF, 
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and  compute  an  estimate  of  the  hyperparameters  as  follows: 

K 

(p.(t+1),  cr(t+1))  =  argmax  V  log p(X,  WS°| /i,  cr) 

il.O  ' 

k= 1 
K 

=  argmax  (logp(WS°|/i,  a)  +  logpfXlW^)) 

k= 1 
K 

—  arg max  log^Wj^l/x,  <r),  (5.24) 

fi.cr  ^  ^ 

k=  1 

where  W[*'  denotes  the  A  -th  imputed  sample  at  iteration  /  . 

It  can  be  shown  that,  the  ML  estimate  of  each  element  of  n  and  a  is: 

<5-25> 

j  k 

(5.26) 

where  denotes  the  ij-th  element  of  W)'1. 

To  initialize  the  EM  procedure,  we  can  make  use  of  the  ML  estimate  of  W,  denoted  by 
WMLE ,  and  let 


/4">  = 

j 

(5.27) 

=  •jJyE>rE  -  4o>)2 

(5.28) 

This  concludes  the  algorithmic  section. 

5.4  Experimental  Evaluation 

5.4.1  Synthetic  EFH  parameter  estimation 

The  dataset  is  generated  for  a  GB-EFH  model  with  6  =  0.  The  model  contains  /  =  100  observed 
variables  and  J  =  10  hidden  variables,  so  the  number  of  parameters  in  W  is  /  x  J  =  1000. 
We  vary  the  size  of  the  training  dataset  from  25  to  200  and  compare  the  performance  of  ML 
estimation  via  gradient  ascent  and  the  Langevin  algorithm  proposed  in  the  previous  section. 

Generating  iid  samples  from  a  general  MRF  is  known  to  be  non-trivial.  However,  for  a  GB- 
EFH  model  exact  samples  can  be  generated  fairly  efficiently  by  employing  the  perfect  sampling 
technique  [28]  when  all  the  elements  of  the  matrix  V  =  WWT  are  non-negative.  To  ensure  this 
property,  we  first  generate  an  M  x  M  matrix  whose  elements  are  uniformly  distributed  in  the 
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0.4 


Figure  5.3:  Details  of  Monte  Carlo  simulations  of  the  Langevin  algorithm,  with  y- axis  corre¬ 
sponds  to  the  value  of  Wn-  Three  chains  of  different  starting  points  are  shown.  The  bum-in  time 
to  reach  convergence  is  approximately  50  transition. 


[0,  0.1]  interval.  Then  W  is  determined  by  performing  an  SVD  on  this  matrix  so  that  V  is  the 
best  rank-  J  approximation. 

There  is  an  un-identifiability  issue  here  because  the  data  distribution 


p(x|W)  =  —  exp 
Zj 


QxtWWtx^) 


(5.29) 


is  a  function  of  V  and  is  invariant  if  W  is  right-multiplied  by  an  orthogonal  matrix  Q  because 
(W Q)(W Q)^  =  WWT.  Also  it  can  be  shown  the  prior  of  W  defined  in  Eq.  5.7  is  also  invariant 
under  this  transformation.  Therefore  our  evaluation  criteria  are  based  on  the  matrix  V  instead  of 
W.  We  define  two  error  measures:  mean  averaged  error  ( mae )  and  mean  relative  error  (mre)  to 
evaluate  an  estimate  V 


mae 


mre 


^'1 

*  3 

1  I Vjj  ~  Vy\ 

M2  i  j  max{\vij\A%\} 


(5.30) 

(5.31) 


Fig.  5.4  shows  the  estimate  of  the  gradient  using  brief  sampling  versus  the  number  of  sam¬ 
pling  steps  l.  We  also  generate  the  same  number  of  samples  using  the  perfect  sampling  technique 
to  provide  an  approximately  correct  version  for  comparison.  Brief  sampling  provides  biased  es¬ 
timation  compared  to  the  exact  sampling  approach,  but  the  bias  is  relatively  small  considering 
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Figure  5.4:  The  estimation  versus  the  number  of  sampling  steps  in  brief  sampling  (solid  line) 
compared  with  the  estimation  perfect  sampling  (dash  line),  with  y- axis  corresponds  to  an  esti¬ 
mated  derivative  of  log-partition  function  <9  log  Z(W)/<9Wn  averaged  over  50  runs.  Both  sam¬ 
pling  schemes  generate  100  samples  in  each  run.  The  standard  error  bars  are  scaled  by  1.64, 
indicating  95%  significance  of  the  difference  in  estimation.  A  single  sampling  step  suffices  as  it 
maximizes  the  program  efficiency  without  increasing  bias  or  variance  of  the  estimation. 


the  difficulty  of  dealing  with  intractable  partition  function.  Note  that  the  bias  is  not  decreased 
by  increasing  l.  The  variance  of  the  estimation,  on  the  other  hand,  is  minimized  when  1  =  1. 
Therefore,  we  let  l  =  1  in  subsequent  experiments. 

Two  tunable  parameters  in  the  Langevin  algorithm  are  yet  to  be  determined:  the  step  size 
e  in  Eq.  5.13  and  the  number  of  steps  l  to  sample  from  data  in  brief  sampling.  We  choose  an 
appropriate  e  by  investigating  the  evolution  of  a  number  of  elements  of  W  during  the  simulation 
of  the  Markov  chain.  Under  a  too  large  step  size  the  chain  goes  to  infinity  in  a  few  steps,  and 
under  a  too  small  one  the  burn-in  time  is  undesirably  long.  Fig  5.3  shows  a  simulation  of  the 
Langevin  algorithm  using  the  step  size  we  choose. 

In  Fig.  5.5  we  compare  the  performance  of  ML  estimation  via  gradient  ascent  and  the  Bayesian 
approach  using  the  Langevin  algorithm.  The  Langevin  algorithm  consistently  achieves  lower  er¬ 
rors  under  both  measures  and  with  different  sizes  of  the  training  set.  As  more  data  are  available, 
the  performance  of  ML  estimation  improves  little;  it  appears  that  the  gradient  ascent  procedure 
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Figure  5.5:  The  Performance  of  ML  learning  and  Bayesian  inference  using  the  brief  Langevin 
algorithm  under  two  different  error  measures  (a)  mean  absolute  error;  (b)  mean  relative  error. 
The  results  are  averaged  over  10  runs;  the  error  bar  may  be  too  small  to  be  distinguishable  from 
the  figure.  The  Bayesian  approach  is  subject  to  less  error  rate  than  its  ML  alternative  in  both 
measures. 


gets  stuck  into  a  local  minimum.  On  the  other  hand,  the  Langevin  algorithm  does  benefit  from 
more  data,  which  is  possibly  the  consequence  of  the  uninformative  prior  we  placed  for  this  prob¬ 
lem  by  setting  =  0,  cr*  =  cZ  =  0.1  for  i  =  1, . . . ,  M.  The  estimation  by  both  methods  has  a 
non-negligible  bias  from  the  true  value,  and  we  conjecture  that  it  is  due  to  the  sparsity  of  the  data. 
We  also  observe  that  the  performance  difference  of  ML  estimation  and  the  Langevin  algorithm 
is  much  larger  as  measured  by  mean  absolute  error  than  mean  relative  error ,  which  suggests  that 
the  latter  algorithm  provides  better  estimates  for  parameters  with  larger  values. 


5.4.2  Classification  of  Text  and  Image  Data 

The  data  set  is  from  the  compiled  TRECVID’03  news  video  collection  in  [1 14].  It  contains  1078 
video  shots  with  captions,  each  of  which  can  be  treated  as  a  document  and  belongs  to  one  of 
five  pre-defined  categories.  1894  binary  word  occurrence  features  and  166  continuous  features 
for  key  images  are  extracted  from  each  document.  We  extend  the  dual-wing  harmonium  (DWH) 
developed  in  [1 14],  which  was  previously  trained  by  ML  estimation,  to  Bayesian  DWH  (BDWH) 
in  which  column-zz'J  multivariate  normal  priors  are  placed  on  the  coupling  matrices  for  word  and 
image  features  respectively.  The  hyperparameters  in  the  priors  are  estimated  using  the  empirical 
Bayes  method  developed  in  Section  5.3.5. 

To  give  a  hint  on  the  difficulty  of  performing  Bayesian  learning  in  a  real  dataset  discussed 
in  Section  5.3.2,  we  implement  the  naive  Monte  Carlo  estimation  of  the  partition  function  in 
Eq.  5.19  for  both  GB-EFH  with  synthetic  dataset  and  DWH  with  real  world  dataset.  The  his¬ 
tograms  of  the  estimated  Z  over  100  runs  are  shown  in  Fig.  5.6.  In  the  synthetic  dataset  the 
estimated  values  approximately  fit  to  a  normal  distribution.  However,  in  the  real  dataset,  there 
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Figure  5.6:  Histogram  of  100  estimations  of  partition  function  using  a  naive  Monte  Carlo  ap¬ 
proximation  on  (a)  synthetic  dataset;  (b)  real-world  dataset.  Arrows  are  centered  at  the  mean 
and  indicate  an  interval  of  length  of  2  times  the  standard  deviation.  Each  estimation  computes 
the  expectation  using  1000  samples.  On  the  real-world  data  set,  the  estimation  is  subject  to 
unacceptably  high  variance. 


(a)  Synthetic  Dataset 


Figure  5.7:  Classification  accuracy  versus  number  of  latent  topics.  Bayesian  DWH  yields  com¬ 
parable  performance  to  the  original  DWH  approach.  Both  produce  better  results  over  the  baseline 
LSI  approach  and  the  GM-LDA  approach  backed  by  a  directed  graphical  model. 


are  a  few  spurious  outliers,  which  shift  the  mean  estimated  values  over  all  the  runs  significantly, 
leading  to  generally  biased,  high  variance  estimates.  In  Fig.  5.6(b)  the  variance  of  the  estimation 
is  three  times  as  large  as  the  estimated  mean. 
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We  evaluate  the  performance  of  four  different  models  LSI  [34],  GM-LDA,  DWH  and  BDWH 
for  classification  task  on  the  news  video  collection.  For  each  algorithm,  the  parameters  are 
estimated  using  all  data,  without  reference  to  their  labels.  Once  the  model  are  learned,  every 
document  in  the  data  are  projected  into  the  lower-dimensional  latent  semantic  space.  The  data 
are  then  randomly  split  to  a  training  set  and  a  testing  set  with  the  same  size.  We  show  the  result 
of  using  one  nearest  neighbor  (1-NN)  classifier  to  predict  the  category  of  each  test  data  given  the 
training  data. 

Fig.  5.7  compares  the  performance  obtained  at  different  dimensions  of  latent  semantic  space, 
or  equivalently  different  numbers  of  latent  topics  ranging  from  4  to  32.  BDWH  and  DWH  achieve 
comparable  classification  accuracy  consistently,  and  outperform  LSI  and  GM-LDA  with  a  good 
margin  when  the  number  of  latent  topics  are  16  and  32.  LSI,  DWH  and  BDWH  all  get  better  per¬ 
formances  in  higher  dimensional  semantic  space  with  less  dimensionality-reduction.  In  contrast, 
GM-LDA  outperforms  other  methods  when  the  number  of  latent  topics  are  4  but  the  performance 
curve  goes  down  when  the  number  of  latent  topics  increases  from  16  to  32,  which  may  reflect  a 
low-dimensionality  bias  from  the  modeling. 


5.5  Conclusion 

We  have  proposed  a  new  Bayesian  formalism  of  EFH  model  and  variants  for  latent  semantic 
modeling  of  text  and  multimedia  data.  The  Langevin  algorithm  conjoint  with  an  MCMC  scheme 
was  applied  to  carry  out  approximate  posterior  inference,  and  an  empirical  Bayes  method  is  also 
developed  for  estimating  the  parameters.  The  Bayesian  approach  achieves  superior  performance 
of  parameter  estimation  on  a  synthetic  data  set  and  comparable  classification  accuracy  on  a  real 
dataset  of  both  text  and  image  data. 

EFH  models  differ  from  singular  value  decomposition  in  that  there  is  a  freedom  to  choose 
different  exponential  family  distribution  to  model  the  input  data,  which  could  be  either  discrete 
or  continuous.  Compared  with  a  linear  ICA  model,  hidden  topics  in  EFH  are  not  assumed  to  be 
statistically  independent.  Instead,  they  are  conditionally  independent  given  the  observed  layer. 
This  reflects  the  assumption  that  topics  could  be  semantically  different  but  correlated,  such  as 
“science”  and  ’’technology”.  Similar  assumptions  could  also  be  spotted  in  other  studies  [5,  17]. 

Our  experiments  presented  in  this  chapter  focus  on  binary  occurrences  of  words  which  is 
suitable  for  short  texts.  A  good  future  direction  is  to  build  an  BEFH  to  directly  model  word 
counts.  Also,  the  independent  Gaussian  prior  we  used  can  be  replaced  by  an  more  informative 
one,  while  the  inference  and  learning  algorithm  can  straightforwardly  apply  to  the  new  formal¬ 
ism.  Finally,  the  discretization  scheme  in  the  Langevin  algorithm  can  be  more  elaborate,  such  as 
incorporating  the  idea  suggested  in  [96]. 
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Chapter  6 

Click  Models:  Leveraging  User  Feedback 
for  Better  Search  Experiences 


In  the  era  of  World  Wide  Web,  search  engines  have  become  essential  tools  for  browsing  and 
navigating  over  vast  amounts  of  information  on  the  internet.  After  a  query  submission,  the  user 
interacts  with  the  search  engine  via  examining  through  the  snippets  from  web  documents  in 
the  ranked  list  of  query  results  and  following  the  hyper-link  with  a  click  if  she  would  like  to 
find  more  about  a  particular  one.  Such  events  are  usually  logged  to  track  user  activities  and 
provide  insights  about  the  performance  of  the  search  engine.  It  is  probably  the  most  extensive, 
albeit  indirect,  surveys  on  user  experience,  especially  in  the  event  that  explicit  user  feedback  is 
either  expensive  or  less  likely  to  be  collected,  which  is  usually  true  for  any  query  system  with  a 
moderate  or  large  user  base. 

In  this  chapter,  we  study  how  to  leverage  user  click  data  to  obtain  a  similarity  score,  known 
as  user-perceived  relevance,  for  any  query  term  and  result  document  pair.  Such  scores  form  a 
important  feature  to  adapt  future  ranking  for  improved  user  experience.  Although  much  of  the 
material  is  presented  in  the  context  of  web  search,  it  should  be  applicable  to  other  search  prob¬ 
lems  as  well,  including  the  task  of  querying  multimedia  databases.  Most  of  the  work  described 
subsequently  is  based  on  the  material  presented  in  [49,  50,  51,  72]. 


6.1  Background 

We  first  introduce  definitions  and  notations  that  will  be  used  throughout  this  chapter.  A  web 
search  user  initializes  a  query  session  by  submitting  a  query  to  the  search  engine.  We  regard 
re-submissions  and  reformulations  of  the  same  query  as  distinct  query  sessions.  Snippets  from 
web  documents  are  presented  in  a  ranked  order  in  the  first  result  page,  where  a  document  in  a 
higher  position ,  or  rank ,  appears  above  those  in  lower  positions.  Such  appearance  is  also  known 
as  impression  in  the  web  search  community.  The  identity  of  a  document  impressed  at  the  7th 
position  is  denoted  by  du  where  the  value  of  i  ranges  between  1  and  M,  the  latter  of  which  is  the 
maximum  number  of  results  displayed  in  the  page. 
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6.1.1  Click  Position-Bias 


If  a  document  is  impressed  and  clicked,  it  indicates  a  positive  feedback  from  the  user  regarding 
the  relevance  of  the  document.  On  the  other  hand,  if  there  is  only  impression  without  any  click, 
it  possibly  links  to  a  negative  feedback.  This  may  seem  a  reasonable  conclusion  at  first  sight, 
however,  empirical  studies  such  as  [62]  have  proved  that  this  straightforward  idea  for  leveraging 
click  data  is  subject  to  heavy  bias  favoring  documents  that  appear  at  higher  positions  to  those 
lower  ones,  regardless  of  their  snippets  or  contents.  This  could  be  (partially)  attributed  to  the 
fact  that  users  are  accustomed  to  examine  over  search  results  in  a  roughly  linear  order  from  the 
top  to  the  bottom,  with  the  possibility  of  an  early  termination,  reflecting  varying  degrees  of  trust 
on  the  ranking  algorithm  of  the  search  engine  tailored  to  their  preferences.  Since  a  typical  user 
may  not  go  over  every  document  in  the  page,  for  a  document  that  appears  at  the  bottom,  even  if 
it  mostly  satisfied  the  user’s  information  needs,  it  may  still  receive  much  less  user  attention  and 
clicks  than  the  top  ranked  one. 

This  position-bias  in  observing  clicks  over  different  positions  can  be  portrayed  by  examining 
Figure  6.1  obtained  from  an  eye-tracking  study  in  [62].  Both  plots  display  how  the  eye  fixation, 
a  measure  of  user  attention,  and  the  number  of  clicks  vary  over  the  top- 10  search  results.  The 
difference  between  the  two  is  in  the  ranking  of  search  results  -  the  top  plot  assumes  the  default 
ranking,  whereas  the  bottom  plot  corresponds  to  a  fully  reverse  one,  switching  the  1st  position 
with  the  10th,  the  2nd  with  the  9th,  and  so  on.  If  there  were  no  position-bias  at  all,  we  would 
expect  the  count  of  clicks  would  be  mirrored  in  the  bottom  plot  as  a  result  of  such  switching. 
However,  the  top  position  in  the  bottom  figure  still  receives  most  user  attention,  and  much  more 
clicks  than  bottom  ones,  which  implies  that  position  should  not  be  excluded  when  assessing 
chances  of  clicks  over  search  results. 

Position-bias  has  become  a  key  challenge  in  the  accurate  interpretation  of  user  clicks  and 
inspires  the  proposal  of  a  number  of  hypotheses  to  provide  formal  description,  followed  by  the 
development  of  click  models  to  offer  more  principled  solutions. 


6.1.2  Basic  Hypotheses 

The  examination  hypothesis,  proposed  first  in  [94],  characterizes  user  interaction  with  two  types 
of  probabilistic  events:  examination  and  click  over  a  document.  The  insight  is  to  impose  exami¬ 
nation  as  a  prerequisite  for  click  over  the  same  document,  and  link  the  relevance  of  the  document 
to  the  conditional  probability  of  a  click  given  that  it  has  been  already  examined.  Formally,  for 
a  given  query  session,  we  use  binary  random  variables  Et  and  Ct  to  represent  the  examination 
and  click  events  of  the  document  at  position  i,  respectively,  and  denote  corresponding  examina¬ 
tion  and  click  probabilities  by  PiE,  =  1)  and  P(Ci  =  1).  The  examination  hypothesis  can  be 
summarized  as  follows: 

P(Ci  =  l\Ei  =  0)  -  0, 

P(Ci  =  l\Ei  =  1)  =  rdi, 


66 


Rank  of  Abstract  (Normal) 

Figure  6.1:  Comparison  of  user  attention  (fixation)  and  clicks  over  top  10  ranks  between  the 
normal  order  and  the  reversed  order  reveals  position  bias.  Plots  are  extracted  from  Figure  4  in 
[62], 

where  1  <  i  <  M,  and  rr/,;  •  defined  as  the  document  relevance ,  is  the  conditional  probability 
of  click  after  examination.  Given  Et,  Ci  is  conditionally  independent  of  previous  examine/click 
events  C^-i.  This  helps  to  disentangle  click  activities  of  various  documents  as  being 

caused  by  position  and  content.  Click  models  based  on  the  examination  hypothesis  share  this 
definition  but  differ  in  the  specification  of  P(Ei). 

The  second  hypothesis,  known  as  the  cascade  hypothesis  [33],  portrays  how  user  examines 
search  results  one  by  one.  It  states  that  users  always  start  the  examination  at  the  first  docu¬ 
ment.  The  examination  is  strictly  linear  to  the  position,  and  a  document  is  examined  only  if  all 
documents  in  higher  positions  are  examined.  Formally, 

P(Ei  =  1)  =  1, 

P(Ei+1  =  l\El  =  0)=0. 

The  corollary  is  that  given  Ei,  El+ ,  is  conditionally  independent  of  all  examine/click  events 
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above  i,  but  may  depend  on  the  click  C,. 

In  the  same  manuscript,  the  cascade  model  is  proposed  by  putting  together  previous  two 
hypotheses  and  further  constrain  that 


P(Ej+i  —  1|  Ej  —  1  ,Ci)  —  1  —  Cu  (6.1) 

which  implies  that  a  user  keeps  examining  the  next  document  until  reaching  the  first  click,  after 
which  the  user  simply  stops  the  examination  and  abandons  the  query  session.  However,  it  is 
unclear  from  the  model  how  to  explain  multiple  clicks  existing  in  the  same  query  session.  A 
quick  solution  is  to  (independently)  apply  the  same  model  multiple  times,  which  does  not  have  a 
sound  probabilistic  interpretation. 

6.2  Proposed  Method 

This  section  is  devoted  to  the  design  and  implementation  of  the  dependent  click  model  (DCM), 
originally  presented  in  [51].  More  sophisticated  Bayesian  click  models  were  presented  in  [49] 
and  [72];  they  share  similar  design  goals  and  application  settings  as  DCM,  and  yield  better 
performance  when  click  data  are  less  available,  at  the  expense  of  additional  computational  time 
and  storage.  All  these  models  take  a  single  pass  over  the  click  data,  scale  linearly  in  time,  and 
need  only  constant  space  for  each  query-document  pair. 

In  the  aforementioned  cascade  model  a  user  always  leaves  the  result  page  upon  the  first  click 
and  never  comes  back.  We  propose  to  include  a  position-dependent  parameter  A,;  to  reflect  the 
chance  that  the  user  would  like  to  see  more  results  after  a  click  at  position  i.  A*’s  is  a  set  of  user 
behavior  parameters  shared  over  multiple  query  sessions.  DCM  inherits  the  assumption  that  in 
the  case  of  examination  without  a  click,  the  next  document  is  always  examined.  This  user  model 
is  shown  in  Figure  6.2. 

Examination  and  click  probabilities  in  DCM  can  be  specified  in  an  iterative  fashion  (1  <  i  < 
M): 

e<h,  i  =  1, 

Cdi,i 

^i^di,i  “1“  (&di,i  Cdi,i)i 

from  which  the  following  closed-form  equations  can  be  derived: 

i— 1 

edi,i  =  II  (X  ~rdi  +  Vdj), 

3  = 1 
i— 1 

Cdi,i  =  rdi  f[  (l  -  rdj  +  A /rd.) .  (6.4) 

3= 1 

This  completes  the  formal  specification  of  the  dependent  click  model  (DCM),  in  which  examine 
probabilities  and  click  probabilities  at  different  positions  i  become  interdependent. 


(6.2) 


(6.3) 
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Figure  6.2:  The  user  model  of  DCM,  in  which  rd.  is  the  document  relevance  of  di,  and  A,  is  the 
user  behavior  parameter  for  position  i. 


Given  the  actual  click  event  {Ci, . . . ,  Cm}  in  a  query  session  as  well  as  the  document  im¬ 
pression  {g?!,  . . . ,  dM},  the  log-likelihood  for  a  query  session  with  one  or  more  clicks  is  given 
by 

i- 1 

Z  =  (^(iog r<k  +  log  A i)  +  (1  -  Ci)  log(l  -  rd.)) 

2=1 

+  Cl  log  rdl  +  (1  -  Cl)  log(l  -  rdl) 

n 

+  log  ^1  —  Az  +  A;  Yl  (1_r<6'))> 

3=1+ 1 
l 

>  l0grd‘  +  “  Ci)  log(1  “  rdi )) 

2=1 

l-l 

+  ^2  ci  l°g  K  +  log(l  -  A/). 

2=1 

If  there  is  no  click  in  this  session,  then  the  log-likelihood  is  a  special  case  with  /  =  M,  Ci  —  Xi  — 
0. 

We  carry  out  DCM  learning  by  optimizing  values  of  rdi  and  A i  to  maximize  the  sum  of 
the  lower  bound  of  log-likelihood  in  Eq.  6.6  over  the  entire  training  set.  Document  relevance 
estimate  for  a  document  d  is  given  by: 

rd= _ #  Click  on  d _ 

#  Impression  of  d  before  last  clicked  position  l 


(6.5) 


(6.6) 
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which  is  the  empirical  conditional  click  probability  of  d  given  it  appears  higher  than  or  at  position 
/.  And  the  maximum-likelihood  estimate  for  the  user  behavior  parameter  [51] 

^  1  #  query  sessions  when  last  clicked  position  =  i 

#  query  sessions  when  position  i  is  clicked 

for  1  <  i  <  M  —  1,  which  is  the  empirical  probability  of  position  i  being  a  not-last-clicked 
position  over  all  query  sessions  in  the  training  set. 

Therefore,  we  can  set  up  two  counting  statistics  to  each  document  d,  and  parse  only  once 
through  the  training  data  to  get  all  such  counts,  and  finally  compute  all  document  relevance 
estimates.  Similarly,  we  need  additional  2 (M  —  1)  global  counts  for  Ays.  This  leads  to  an 
learning  algorithm  with  linear  time  complexity  with  respect  to  the  number  of  query  sessions 
and  linear  space  complexity  with  respect  to  the  number  of  distinct  query-document  pairs.  When 
new  data  are  available,  we  can  do  fast  update  and  re-computation  based  on  these  counts,  while 
maintaining  the  linear  scalability. 

An  important  difference  of  DCM  from  simply  computing  the  clickthrough  rate,  the  number 
of  clicks  divided  by  the  number  of  impression,  is  that  clicks  indicate  both  relevance  and  exam¬ 
ination.  So  if  a  document  is  not  clicked,  it  can  be  attributed  to  either  the  document  abstract 
is  examined  but  not  relevant  enough  to  be  clicked,  or  it  appears  lower  than  other  documents 
that  draw  the  user  attention  away.  This  explain-away  effect  is  reflected  in  Eq.  6.7  by  a  smaller 
denominator  which  only  counts  impression  before  last  clicks. 

Finally,  we  give  the  sampling  procedure  for  DCM  which  draws  examination  variables  E%  and 
click  variables  C,  one-by-one  starting  from  the  top  position,  as  applied  in  the  next  section  to  carry 
out  empirical  evaluation: 


Si  =  1; 

if£  =  o, 

Ci  =  0,  £j+ 1  =  0; 

else 

Ci  ~  Bernoulli^), 

£i+i  ~  Bernoulli]!  —  C,  +  A iCi). 


(6.9) 


6.3  Experimental  Evaluation 

We  report  our  experimental  studies  in  this  section,  which  is  based  on  over  8.8  million  queries  ses¬ 
sions  after  data  preprocessing,  sampled  from  the  click  log  of  a  major  commercial  search  engine 
in  July  2008.  We  are  comparing  the  proposed  DCM  with  the  independent  click  model  (ICM),  the 
baseline  approach  unaware  of  the  position-bias  and  the  chance  of  no  examination1,  under  which 

'The  fulltext  of  [37]  proposing  user  browsing  model,  which  appears  in  the  SIGIR  conference  in  2008,  became 
publicly  available  after  we  have  developed  the  DCM.  ICM  reflected  the  best  performed  alternative  to  the  extent  of 
our  knowledge  while  we  conducted  the  experiment. 
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Table  6.1:  Summary  of  Test  Set 


Query  Freq 

#  queries 

#  Sessions 

Avg  #  Click 

1—9 

59,442 

216,653  (5.4%) 

1.310 

10-31 

30,980 

543,537  (13.5%) 

1.239 

32-99 

13,667 

731,972(17.7%) 

1.169 

100-316 

4,465 

759,661  (18.9%) 

1.125 

317-999 

1,523 

811,331  (20.1%) 

1.093 

1000-3162 

553 

965,055  (24.0%) 

1.072 

the  probability  of  a  click  is  solely  determined  by  the  identity  of  the  document,  and  equals  the 
past  clickthrough  rate  of  the  same  document.  In  the  following,  we  start  with  experimental  setup 
in  Section  6.3.1,  and  proceed  with  detailed  results  under  two  evaluation  metrics  in  Section  6.3.2 
and  Section  6.3.3,  respectively. 

6.3.1  Experimental  Setup 

The  data  set  is  obtained  by  sampling  the  click  log  of  a  major  commercial  search  engine  during 
July  2008.  The  click  log  consists  of  the  query  string,  the  time-stamp,  document  impression  data 
(identities  of  top- 10  documents  in  the  first  page)  and  click  data  (whether  each  document  is  clicked 
or  not)  for  each  query  session.  Only  query  sessions  with  at  least  one  click  are  kept  for  better  data 
quality  since  we  find  from  additional  meta-information  that  clicks  on  ads,  query  suggestions  or 
other  elements  are  much  more  likely  to  appear  for  the  ignored  sessions  with  no  clicks.  It  also 
provides  clearer  comparison  of  performances  on  predicting  the  first  and  last  clicked  position. 
For  each  query,  we  sort  its  query  sessions  by  time-stamp  and  split  them  into  training  set  and  test 
set  of  equal  sizes.  The  number  of  query  sessions  in  the  training  set  is  4,804,633.  Then  these 
queries  are  categorized  according  to  the  query  frequency  in  the  test  set.  Top  0.16%  (178)  most 
frequently  searched  queries  (also  known  as  head  queries )  with  frequencies  greater  than  103'5  are 
not  included  in  the  subsequent  results  on  test  set  because  most  search  engines  already  do  very 
well  on  these  queries.  After  data  preprocessing,  the  test  set  consists  of  4,028,209  query  sessions 
for  110,630  distinct  queries  in  6  query  frequency  categories.  The  average  number  of  clicks  per 
query  session  is  1.139.  Statistics  of  the  test  set  are  summarized  in  Table  6.1.  Note  that  our  data 
set  includes  a  great  number  of  tail  queries  which  are  often  ignored  in  experiments  conducted  in 
previous  studies,  and  performances  over  all  query  sessions  are  not  dominated  by  head  queries  or 
a  particular  query  frequency  range. 

For  each  query,  document  relevance  estimates  for  DCM  are  computed  using  Eq.  6.7  on  the 
training  data,  and  for  ICM  it  equals  the  clickthrough  rate.  But  for  documents  which  appear 
very  few  times  in  the  training  set  and  which  appear  only  in  the  test  set,  document  relevance 
are  replaced  by  position  relevance,  which  are  computed  for  each  position  in  a  similar  way,  for 
deriving  log-likelihood  and  other  metrics  in  the  test  set.  This  has  a  smoothing  effect  on  the 
document  relevance,  and  leads  to  better  performance  for  the  evaluation  on  the  test  data.  Since  the 
additional  counts  that  we  need  to  keep  in  the  computation,  2 M  for  each  query,  is  usually  much 
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smaller  than  the  cost  saving  from  low-frequency  documents,  the  time  and  space  complexities 
can  also  be  reduced.  The  cut  off  of  minimum  number  of  impression  for  document  relevance 
computation  is  set  adaptively  according  to  the  query  frequency  category  (as  shown  in  Table  6.1) 
from  1  to  6.  Finally,  to  avoid  infinite  values  in  the  evaluation,  we  further  imposes  a  lower  bound 
of  0.01  and  an  upper  bound  of  0.99  on  the  learned  relevance  values  for  both  models  as  well  as 
user  behavior  parameters  in  DCM. 

Parsing  the  data  from  the  hard  disk  and  loading  them  into  main  memory  takes  around  45 
minutes.  All  the  subsequent  experiments  are  carried  out  in  a  server  machine,  with  2.67GHz  CPU 
cores,  32GB  memory,  Windows  Server  2008  64-bit  OS,  and  MATLAB  R2008a  installed.  The 
computational  time  for  training  DCM  is  no  more  than  7  minutes. 

6.3.2  Evaluation  based  on  Log-Likelihood 

Log-likelihood  (LL)  is  widely  used  in  the  machine  learning  community  to  measure  model  fitness. 
Here  it  indicates  a  soft-version  of  the  probability  that  clicks  from  the  model  prediction  over  top  10 
positions  are  consistent  with  the  ground  truth  over  the  test  set.  More  formally,  given  the  document 
impression  for  each  query  session  in  the  test  data,  LL  is  defined  as  the  log  probability  of  observed 
click  events  computed  under  the  trained  model.  A  larger  LL  indicates  better  performance,  and 
the  optimal  value  is  0.  The  improvement  of  LL  value  %  over  t2  is  computed  as  (exp^  —  £2)  — 
1)  x  100%.  We  report  average  LL  over  multiple  query  sessions  using  arithmetic  mean. 

Figure  6.3  presents  log-likelihood  curves  for  different  query  frequencies,  where  larger  log- 
likelihood  results  indicate  better  fit  on  the  test  data.  DCM  achieves  larger  performance  gain  for 
more  frequent  queries,  and  consistently  outperforms  ICM  by  over  10%  when  the  query  frequency 
is  over  100.  The  difference  is  only  less  significant  for  tail  queries  of  frequencies  less  than  10. 

The  DCM  curve  goes  below  ICM  for  queries  with  frequencies  less  than  10L5.  But  this  does 
not  imply  that  we  should  always  apply  ICM  to  model  these  queries.  Instead,  we  suggest  that 
lower  confidence  should  be  given  in  document  relevance  estimates  derived  from  click  models 
for  these  tail  queries.  We  could  still  record  counting  statistics  for  these  queries,  but  document 
relevance  estimates  should  be  reliable  when  new  data  flow  in  and  the  amount  of  training  data  is 
enough  to  obtain  a  good  fit. 

6.3.3  Evaluation  based  on  Position-Bias  Plots 

Click  position-bias  could  be  easily  visualized  by  drawing  a  curve  for  probabilities  of  clicks  over 
the  top- 10  positions  based  on  the  test  data.  And  we  compare  the  derived  click  probabilities  from 
both  DCM  and  ICM  with  the  ground  truth  in  Figure  6.4.  DCM  matches  these  probabilities  very 
well  at  the  top  5  positions.  The  higher  tail  of  empirical  curves  is  probably  due  to  user  scrolling 
behaviors,  especially  for  informational  queries  which  have  a  higher  click  through  rate.  And  we 
suspect  that  users  may  examine  documents  in  a  different  fashion  when  they  scroll  to  the  bottom 
of  the  search  result  page,  so  that  the  10th  position  receives  even  more  last  clicks  than  the  two 
above.  However,  they  contribute  to  a  fairly  small  fraction  of  overall  results:  clicks  after  position 
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Figure  6.3:  Log-likelihood  per  query  session  on  the  test  data  for  different  query  frequencies. 
The  overall  average  for  DCM  is  -1.327,  compared  with  -1.401  for  ICM  which  reflects  a  7% 
improvement. 

6  represent  only  6.1%  of  the  total  number.  ICM  tends  to  over-estimate  clicks  in  lower  positions, 
given  the  bias  that  it  assumes  every  position  is  always  examined. 

A  unique  property  of  DCM  is  that  examination  probabilities  could  be  computed  for  each 
query  session  and  they  are  aggregated  together  to  provide  a  hint  on  user  attention  over  different 
positions,  which  corresponds  to  the  dashed  curve  in  Figure  6.4.  The  first  position  is  always 
examined  from  the  modeling  assumption,  followed  by  a  geometrically  decreasing  pattern  after 
position  2.  Compared  with  the  DCM  click  curve,  the  gap  between  them  reflects  the  conditional 
click  probabilities  for  each  position,  which  suggests  larger  probabilities  for  both  top  and  bottom 
positions.  Note  that  both  curves  go  below  the  empirical  click  for  the  last  position,  and  this  bias 
is  attributed  to  user  behaviors  beyond  the  modeling  assumption  as  discussed  previously. 

Figure  6.5  displays  detailed  DCM  examination  probabilities  for  different  query  frequencies. 
All  of  them  share  similar  decreasing  pattern  but  differ  in  absolute  values.  The  trend  is  that  less 
frequent  queries  tend  to  be  examined  in  greater  depth,  and  we  also  observe  more  clicks  per  query 
session  in  the  click  log  for  them. 

6.3.4  Predicting  First  and  Last  Clicks 

We  now  focus  on  clicks  and  test  whether  samples  generated  from  ICM  and  DCM  provide  good 
match  of  first  and  last  clicked  position  compared  with  the  empirical  data.  Given  each  query 
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Figure  6.4:  Click  probabilities  for  different  positions  summarized  from  ICM/DCM  samples  as 
well  as  test  data,  and  examine  probabilities  implied  by  DCM.  The  click  distribution  implied  by 
DCM  matches  the  ground  truth  closely. 

session  in  the  test  set,  we  use  the  document  relevance  learned  from  the  training  set  to  determine 
the  click  probability.  For  ICM,  clicks  are  sampled  for  each  position  independently,  whereas  for 
DCM,  sampling  starts  from  the  top  ranked  document  and  ends  at  either  the  first  non-examined 
position  or  the  last  (10th)  position.  For  both  models,  we  collect  100  samples  with  at  least  one 
click,  then  first  and  last  clicked  position  are  identified  from  the  simulated  click  data  and  compared 
with  the  ground  truth  to  compute  RMS  errors.  This  is  the  most  time-consuming  part  in  the  model 
evaluation  experiments  and  takes  around  one  hour  to  finish.  To  reflect  the  inherent  randomness 
in  user  click  behavior,  we  also  compute  for  each  query  the  standard  deviation  of  first  and  last 
clicked  position  and  take  a  weighted  mean  over  different  queries  to  approximate  the  lower  bound 
of  RMS  error.  This  corresponds  to  the  “optimal”  curves  in  Figure  6.6.  We  expect  a  model 
that  gives  consistently  best  fit  of  click  data  would  have  the  smallest  margin  with  respect  to  the 
optimal  error,  and  this  margin  also  reflects  the  robustness  of  model  prediction  since  the  RMS 
error  metric  takes  account  of  both  bias  and  variance  in  prediction.  Finally,  we  aggregate  results 
over  all  queries  and  compare  the  distribution  of  first  and  last  clicks  from  two  click  models  with  the 
empirical  distribution  of  the  test  data,  which  corresponds  to  the  “empirical”  curves  in  Figure  6.7. 

RMS  errors  for  ICM  and  DCM  are  close  for  first  clicked  position  because  their  model  as¬ 
sumptions  are  the  same  until  the  first  click.  Predicting  last  clicked  position  turns  out  to  be  a 
more  difficult  task  as  demonstrated  by  higher  error  curves  in  Figure  6.6(b)  than  6.6(a).  With 
a  position-dependent  modeling  assumption,  DCM  outputs  more  reasonable  last  click  estimates 
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Figure  6.5:  Examine  probabilities  implied  by  DCM  for  different  query  frequencies.  Queries  are 
grouped  into  6  frequency  ranges  similarly  as  in  Table  6.1.  Darker  and  lower  curves  correspond 
to  more  frequent  queries. 


than  ICM,  reducing  the  RMS  error  gap  from  the  optimal  curve  by  around  30%. 

Figure  6.7  illustrates  generally  slower  than  geometric  decrease  with  the  position  for  the  em¬ 
pirical  probabilities  of  both  first  and  last  clicks.  DCM  matches  these  probabilities  very  well  at  the 
top  5  positions.  The  higher  tail  of  empirical  curves  is  probably  due  to  user  scrolling  behaviors, 
especially  for  informational  queries  which  have  a  higher  click  through  rate.  And  we  suspect  that 
users  may  examine  documents  in  a  different  fashion  when  they  scroll  to  the  bottom  of  the  search 
result  page,  so  that  the  10th  position  receives  even  more  last  clicks  than  the  two  above.  However, 
they  contribute  to  a  fairly  small  fraction  of  overall  results:  clicks  after  position  6  represent  only 
6.1%  of  the  total  number.  For  ICM  samples,  documents  that  appears  in  lower  positions  may 
receive  more  clicks  than  the  ground  truth  because  of  the  position-independent  assumption.  This 
results  in  over-estimation  of  last  click  probabilities  for  these  positions  in  Figure  6.7(b).  On  the 
other  hand,  the  document  relevance  estimates  in  ICM  is  smaller  than  those  in  DCM,  due  to  a 
larger  denominator  in  computing  the  empirical  probabilities.  This  under-estimation  has  a  more 
significant  effect  on  documents  which  usually  appear  in  lower  positions  and  after  the  last  clicked 
position.  Therefore,  the  first  click  probability  distribution  derived  from  ICM  has  a  lower  tail  than 
the  empirical  curve,  as  shown  in  Figure  6.7(a). 
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Figure  6.6:  Root-mean-square  (RMS)  errors  for  predicting  (a)  first  clicked  position  and  (b)  last 
clicked  position.  Results  are  averaged  over  100  samples  per  query  session. 
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Figure  6.7:  First  click  distribution  (a)  and  last  click  distribution  (b)  obtained  by  drawing  sam¬ 
ples  from  DCM  and  ICM  given  document  impression.  The  overall  first/last  click  distribution  of 
DCM  samples  matches  the  empirical  distribution  in  the  test  set  very  well,  particularly  for  top  5 
positions.  Results  are  averaged  over  100  samples  per  query  session. 
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6.4  Related  Work 


6.4.1  Click  Log  Analysis  and  Learning  to  Rank 

One  of  the  earliest  publications  on  large  scale  query  log  analysis  appeared  in  1999  [100],  which 
presented  interesting  statistics  as  well  as  a  simple  correlation  analysis  from  the  Alta  Vista  search 
engine.  Thereafter,  search  logs,  especially  the  click-through  data,  have  been  utilized  for  a  spec¬ 
trum  of  search-related  applications,  and  for  learning  to  rank  in  particular. 

Joachims  [60]  presented  a  pioneering  study  to  exploit  clickthrough  data  for  optimizing  the 
ranking  function  for  search  engines.  Pairwise  preference  feedback,  such  as  web  document  i  is 
more  relevant  as  web  document  j,  are  extracted  from  click  logs  and  used  to  train  a  ranking  sup¬ 
port  vector  machine  (ranking  SVM)  to  output  a  retrieval  function  most  concordant  with  these 
partial  orderings.  It  was  extended  by  Radlinski  et  al.  [92],  and  an  algorithm  was  proposed  to 
detect  a  sequence  of  reformulated  queries  from  the  same  user  to  leam  an  improved  function. 
Radlinski  et  al.  [93]  followed  this  line  of  study  for  optimizing  ranking  functions  but  takes  an  al¬ 
ternative  active-learning  approach  to  control  documents  presented  to  users  in  search  result  pages 
for  obtaining  more  helpful  feedback  as  the  next-round  training  data. 

Xue  et  al.  [115]  proposed  to  use  clickthrough  data  to  improve  graph-based  static  ranking 
algorithms.  Bilenko  et  al.  [15]  presented  a  novel  study  in  identifying  “search  trails”  from  user 
activity  logs  and  used  a  random-walk  based  algorithm  for  improved  retrieval  accuracy.  In  [23] 
Carterette  et  al.  proposed  a  logistic  model  for  relevance  prediction  using  scores  obtained  from 
human  judges. 

Clickthrough  data  could  be  also  combined  with  other  implicit  measures  or  browsing  data 
available  from  query  logs  to  improve  web  search.  The  studies  by  Agichtein  et  al.  proposed  to 
extract  a  spectrum  of  features  from  browsing  and  click  activities  as  well  as  textual  data  to  train 
a  better  ranker  [3]  and  estimate  user  preference  [4].  An  earlier  work  in  evaluating  these  implicit 
measures  appeared  in  [40].  Note  that  these  additional  information  may  not  be  able  to  be  collected 
everyday  due  to  the  huge  search  volume.  And  it  may  also  be  subject  to  high  level  of  noise,  e.g., 
web  page  dwelling  time  may  be  inaccurate  if  a  user  locks  the  screen  to  have  a  break  with  the 
browser  open. 

6.4.2  User  Behavior  Study  and  Click  Models 

A  central  task  in  utilizing  search  log  is  to  understand  and  model  user  search  and  browsing  be¬ 
haviors  and  click  decision  processes.  Joachims  and  his  collaborators  pioneered  this  direction 
by  presenting  a  series  of  studies  around  some  eye-tracking  experiments  [61,  62],  which  inspired 
a  series  of  models  that  interpret  user  behaviors  with  increasing  capacity,  namely,  the  cascade 
model  [33]  already  discussed,  and  the  user  browsing  model  proposed  by  Dupret  et  al  [37]. 

Following  the  proposal  of  DCM,  more  studies  have  borrowed  the  Bayesian  framework  and 
techniques  from  probabilistic  graphical  models  to  develop  more  sophisticated  alternatives  with 
dedicated  user  behavior  assumptions.  This  includes  the  click  chain  model  [50],  dynamic  Bayesian 
network  model  [27],  Bayesian  browsing  model  [72].  Despite  the  improvement  over  click  pre- 
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diction  and  extended  power  granted  by  the  Bayesian  modeling  [73],  they  share  a  few  simplify¬ 
ing  assumptions  such  as  homogenous  treatment  of  different  query  sessions  and  are  not  without 
limitations  in  a  number  of  important  aspects,  such  as  personalization,  query  reformulation,  tail 
queries  /  data  sparsity,  and  multimedia  search  results.  Most  recent  developments  and  generaliza¬ 
tions  in  this  area  has  started  to  address  these  challenges  from  different  perspectives. 

Matthijs  and  Radlinski  [77]  presented  a  novel  personalization  approach  for  building  user  pro¬ 
files  based  their  past  behavior  to  help  re-rank  future  search  results  in  better  alignments  with  their 
preferences.  Their  implementation  is  publicly  available  as  a  browser  add-on  with  detailed  doc¬ 
umentation  [76].  Dupret  and  Liao  [36]  proposed  the  session  utility  model  to  characterize  how 
users  satisfy  their  information  needs  in  a  series  of  query  submissions.  Their  analysis  showed 
favorable  results  for  informational  queries  and  other  type  of  queries  which  share  a  pattern  that 
involves  a  relatively  large  number  of  reformulation  and  resubmission.  Zhu  et  al.  [121]  introduced 
a  generalized  click  model  with  the  idea  that  relevance  in  the  click  model  could  be  a  function  of 
multiple  predicative  input  attributes  (features),  which  effectively  adds  a  regression-like  com¬ 
ponent.  The  attributes  learned  are  feature-specific  instead  of  query-specific,  and  the  model  is 
expected  to  perform  well  with  tail  queries. 


6.5  Conclusion  and  Discussion 

Web  search  click  logs  record  and  aggregate  important  implicit  feedbacks  from  user  browsing  and 
click  activities,  representing  one  of  the  most  extensive,  yet  indirect,  surveys  on  user  experience. 
They  are  valuable  resources  for  both  information  retrieval  researchers,  to  better  understand  hu¬ 
man  interaction  with  retrieval  results  and  calibrate  their  hypotheses  or  models,  and  web  search 
practitioners,  to  measure,  monitor  and  learn  to  improve  search  engine  performance.  A  key  chal¬ 
lenge  in  click  log  analysis  is  to  obtain  accurate,  efficient  interpretation  of  user  clicks,  despite  the 
fact  that  they  are  “informative  but  biased”  as  absolute  relevance  judgments,  as  demonstrated  in  a 
number  of  previous  studies. 

Click  models  usually  incorporate  a  statistical  depiction  of  user  interaction  with  web  search 
results  in  a  query  session,  by  specifying  probabilities  of  examination  and  clicks  at  different  po¬ 
sitions  and  how  they  depend  on  each  other.  They  provide  principled  solutions,  scaling  up  to 
terabyte/petabyte  scale  click  data,  to  inferring  user-perceived  relevance  of  web  documents,  and 
modeling  outputs  could  be  further  leveraged  in  various  search-related  applications  including 
search  engine  quality  evaluation  and  sponsored  search  auctions.  The  idea  and  principle  apply 
to  search  interfaces  for  multimedia  search  (better  known  as  federated  search)  as  well,  with  the 
introduction  of  additional  types  of  bias  over  user  examination  and  click  probabilities  reflecting 
different  presentation  styles  (e.g.,  images  and  videos). 

This  chapter  provides  a  self-contained  discussion  of  click  models  employing  the  dependent- 
click  model  as  the  running  example.  Trade-offs  in  model  design  and  choices  of  user  behavior 
assumptions  should  be  dependent  on  specific  application  settings.  DCM  serves  as  an  easy  and 
efficient  example  that  would  be  quite  convenient  for  fast-prototyping  and  generally  performs 
well  with  abundant  data.  A  brief  coverage  of  click  chain  model,  featuring  Bayesian  statistical 
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techniques  and  more  dedicated  user  model,  is  left  to  Chapter  A  in  the  appendix.  This  field  has 
attracted  a  lot  of  interests  from  academic  and  industrial  researchers  within  the  community  of 
Web  Search  and  Data  Mining  since  2009,  with  a  number  of  exciting  new  ideas  that  will  further 
push  forward  the  boundary  of  the  state-of-the-art  to  be  expected  next  year. 

Small  ideas  and  subtle  implementation  details  could  also  play  a  dramatic  role  in  addition  to 
novel  ideas  and  models  from  academic  research.  For  example,  tracking  and  logging  how  long 
users  spend  on  the  landing  page  after  a  click  may  provide  informative  signals  and  lead  to  a  differ¬ 
ent  treatment  for  very  short  clicks.  Also,  for  pages  that  are  “quickly  viewed  and  reloaded”  [39], 
these  short  impressions  could  well  be  eliminated  at  all. 
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Part  IV 
Conclusion 
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Chapter  7 

Concluding  Remarks 


Having  elaborated  on  the  motivation,  related  literatures,  algorithmic  details  and  experimental 
evaluation  for  each  study  in  the  preceding  five  chapters,  we  bring  this  document  to  a  close  by 
reiterating  major  contributions  under  the  theme  of  mining  and  querying,  with  a  sketch  of  future 
directions  that  comes  in  sequel. 


7.1  Mining  Multimedia  Data 

Chapter  2  presented  QMAS  in  the  context  of  satellite  image  analysis.  The  discussion  started  off 
with  a  dedicated  multi-scale  feature  extraction  procedure  from  image  tiles.  An  efficient  subspace 
clustering  algorithm  was  introduced  to  capture  the  similarity  between  tiles,  achieving  over  40 
times  speed-up  over  a  prevailing  implementation  of  approximate  nearest-neighbor  finding  with¬ 
out  any  expense  of  quality.  Low-labor  labeling  was  made  possible  by  constructing  a  three-layer 
graph  based  on  clustering  outputs,  and  executing  a  slight  variation  of  random  walk  with  restart 
algorithm.  It  provided  high  quality  labeling  results,  even  with  tiny  sets  of  pre-labeled  data  as  in¬ 
puts.  It  could  also  spot  top  representatives  and  outliers  and  offered  a  compact  summarization  of 
a  large  data  set.  The  implementation  was  also  employed  to  perform  a  set  of  practical  queries  over 
a  proprietary  data  set  by  domain  experts  and  it  yielded  quite  positive  results  -  correct  labeling  of 
objects  where  the  traditional  automated  target  recognition  (ATR)  approach  may  fail. 

Chapter  3  introduced  MultiAspectForensics ,  a  handy  tool  to  automatically  detect  and  visual¬ 
ize  novel  subgraph  patterns  within  a  local  community  of  nodes  in  a  heterogeneous  network ,  such 
as  a  set  of  vertices  that  form  a  dense  bipartite  graph  whose  edges  share  exactly  the  same  set  of 
attributes.  It  was  effective  even  if  such  patterns  exist  among  less-well  connected  nodes  which  are 
very  likely  to  be  ignored  by  many  extant  methods.  Empirical  results  exhibited  valuable  insights 
derived  from  pattern  discovered,  across  multiple  application  domains  such  as  network  traffic 
monitoring,  knowledge  networks,  and  bioinformatics.  These  successes  could  be  attributed  to  the 
fact  that  we  resorted  to  a  tensor-based  representation  to  facilitate  data  decomposition,  reached  a 
key  observation  leading  to  spike  patterns  in  histogram  plots,  and  revealed  typical  substructures 
reflecting  spectral  properties  of  heterogeneous  data. 

I  hasten  to  point  out  that  although  both  multimedia  data  mining  studies  briefed  hereinabove 
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purport  to  adopt  network  representation  of  the  multi-aspect  data  in  a  more  or  less  similar  way, 
their  approaches  are  not  alike.  In  the  former  scenario,  the  graph  consists  of  multiple  layers  of 
node,  each  of  which  is  attributed  to  one  aspect  of  the  data,  and  the  structural  information  is 
made  salient  by  the  inter-layer  links,  e.g.,  a  priori  curated  labeling  inputs  were  transformed  into 
edges  connecting  their  respective  nodes  in  the  image  layer  and  those  in  the  layer  of  annotation 
vocabulary.  The  labeling  problem  could  be  conveniently  translated  to  cross-modal  proximity 
query.  And  for  the  latter  piece,  the  heterogeneity  lies  on  the  edge  level,  i.e.,  edges  in  the  graph 
may  carry  one  or  more  attributes,  which  renders  tensor-based  representation  a  natural  solution 
and  spectral  analysis  rather  straightforward  to  carry  out.  Hence,  we  hope  that  discussion  in  this 
paragraph  could  shed  some  light  on  the  interdependence  among  the  mining  task,  the  choice  of 
data  representation,  and  subsequent  algorithmic  design. 


7.2  Querying  Multimedia  Data 

Chapter  4  described  CDEM ,  an  online  interface  to  assist  biological  researchers  to  perform  flex¬ 
ible  querying  and  exploration  over  a  large  database  which  consists  of  embryo  images,  image 
annotations,  as  well  as  genes  whose  expression  patterns  are  illustrated  by  these  images.  The  data 
representation  scheme  is  closely  aligned  with  that  in  Chapter  2. 

Chapter  5  provided  a  Bayesian  approach  to  inference  and  learning  with  the  exponential  fam¬ 
ily  harmonium  (EFH)  models  and  their  variants  for  latent  semantic  projection  of  multimedia 
documents  for  subsequent  data  mining  tasks  such  as  classification  and  retrieval.  The  technique 
differed  from  previous  chapters  in  that  the  input  data  are  recognized  as  snapshots,  or  observa¬ 
tions,  of  abstracted  random  variables  following  pre-assumed  probability  distributions,  probably 
with  parameters  yet  to  be  estimated.  The  data  heterogeneity  is  accounted  in  the  model  setup, 
specifically  in  the  selection  of  appropriate  distributions,  e.g.,  binary  word  occurrence  feature 
corresponds  to  a  Bernoulli  distribution,  word  counts  may  follow  a  Poisson  one,  whereas  image 
features  could  be  fit  with  Gaussian. 

Chapter  6  served  as  a  short  tutorial  of  click  models,  with  DCM  as  an  introductory  example. 
Click  models  have  attracted  a  lot  interests  from  academic  researchers  and  industrial  practition¬ 
ers  since  just  a  few  years  ago.  The  studies  highlighted  in  this  manuscript  are  among  the  first  few 
papers  on  this  topic  and  represented  state-of-the-art  at  the  time  of  publication.  They  provide  prin¬ 
cipled  solutions  to  obtain  accurate  interpretation  of  user  clicks  as  they  interact  with  Web  search 
results,  to  measure,  monitor  and  learn  to  improve  search  engine  performance .  Our  proposed 
models  represented  state-of-the-art  along  this  line  of  research,  scaled  up  to  terabyte/petabyte  size 
of  click  data,  and  have  been  further  leveraged  in  various  search-related  applications  including 
search  engine  quality  evaluation  [49]  and  post-rank  reordering  [73].  The  idea  and  principle  are 
ready  to  be  applied  to  search  interfaces  for  multimedia  databases  as  well. 

An  interesting  observation  is  that  compared  with  studies  covered  in  the  previous  three  chap¬ 
ters  takes  a  quite  user-centric  perspective,  while  those  on  the  topic  of  mining  addresses  the  need 
of  data  owner  to  take  advantage  of  their  assets.  Moreover,  it  is  worth  noting  that  exact  infer¬ 
ence  for  most  probabilistic  graphical  models  in  practice,  including  those  proposed  in  Chapters  5 
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and  6,  is  intractable,  and  when  there  are  multiple  approximation  alternatives,  specific  applica¬ 
tion  requirements  usually  dictate  the  trade-off  between  quality  and  scalability.  For  instance,  Web 
search  click  models  are  desired  to  be  both  single-pass  and  incremental,  to  avoid  the  potentially 
high  cost  to  retain  and  revisit  old  data,  and  adapt  to  new  trends  without  much  effort. 


7.3  Discussion  and  Future  Directions 

Studies  presented  in  this  thesis  represents  a  miniature  of  diverse  tasks,  goals,  design  constraints, 
algorithms,  heuristics,  and  experimental  methods  belated  to  multimedia  data  analysis  in  prac¬ 
tice.  We  sought  to  achieve  an  appropriate  trade-off  between  the  two  aspects  of  performance  - 
quality  and  scalability.  It  is  not  uncommon  in  the  design  of  a  real  mining  and  querying  system, 
different  components  will  be  constrained  in  dissimilar  ways.  For  example,  expensive  tensor  de¬ 
composition  algorithms  could  be  applied  to  build  an  offline  index  for  a  recommendation  system, 
whereas  online  ranking  adjustments  should  be  carried  out  in  a  much  more  efficient  fashion.  To 
rank  thousands  or  millions  of  candidates  for  a  given  query,  cheap  and  quick  heuristics  could  be 
implemented  at  the  indexing  layer,  whereas  the  final  ranking  layer,  which  receives  a  small  subset 
of  more  relevant  candidates,  could  afford  more  dedicated  models. 

The  constantly  evolving  scale,  genre  and  ubiquity  of  multimedia  data  brings  up  challenges 
and  opportunities  into  the  fast  developing  field  of  research  and  practice  of  multimedia  mining 
and  querying,  with  a  quite  incomplete  list  of  questions  highlighted  as  follows: 

•  Can  we  better  leverage  the  distributed  computing  frameworks  to  grant  greater  scalability 
to  existing  solutions?  The  PEGASUS  system  [88]  serves  as  an  excellent  example  along 
this  direction. 

•  Can  we  make  available  generic  implementations  of  state-of-the-art  algorithms  to  stimulate 
cross-disciplinary  impacts?  This  may  seem  tedious  at  first  but  would  probably  lead  to 
long-term  benefits. 

•  How  to  evaluate  the  effectiveness  of  a  mining  algorithm  for  novel  pattern  discovery  beyond 
validating  anecdotal  evidence?  In  particular,  in  what  case  is  crowd-sourcing  a  reliable 
alternative  when  domain  expertise  is  inaccessible  or  prohibitive? 

Recall  the  opening  example  in  this  thesis  illustrating  the  emerging  trend  of  mobile  and  social 
in  the  beginning  of  the  current  decade,  the  quest  of  better  vehicles  to  boosting  the  bandwidth 
and  quality  of  information  flow  via  computation,  in  particular,  improved  recommendation  and 
ranking  of  multimedia  units  across  different  platforms  and  contexts,  would  always  be  exciting 
problems  to  be  working  on,  which  I’d  like  to  pursue  as  part  of  my  future  career. 
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Appendix  A 

Click  Chain  Model 


A.l  Introduction 

Web  click  are  informative  but  biased.  In  order  to  neutralize  various  biases,  numerous  click 
models  have  been  developed  as  a  principled  approach  to  inferring  user-perceived  relevance  of 
web  documents.  Furthermore,  as  search  logs  can  easily  run  over  terabytes  into  petabytes  and 
usually  comes  in  a  data  stream  on  a  daily  basis,  we  would  ideally  expect  click  models  to  be 
amenable  to  click  data  streams,  which  essentially  means  scalability  and  the  capability  to  be 
incrementally  updated. 

In  this  part  of  the  appendix,  we  present  the  Click  Chain  Model  (CCM),  which  features  a  solid 
Bayesian  foundation  and  great  scalability.  In  particular,  document  relevance  are  represented  as 
random  variables  in  the  model,  and  a  closed-form  solution  to  the  relevance  posterior  is  derived 
from  the  proposed  approximate  inference  scheme.  Based  on  a  data  set  containing  8.8  million 
query  sessions,  we  show  that  CCM  consistently  outperforms  two  state-of-the-art  competitors  in 
a  number  of  metrics,  with  over  9.7%  better  log-likelihood,  over  6.2%  better  click  perplexity,  and 
much  more  robust  (up  to  30%)  prediction  of  the  last  clicked  position. 


A.2  Background 

A  web  search  user  initializes  a  query  session  by  submitting  a  query  to  the  search  engine.  We 
regard  re- submissions  and  reformulations  of  the  same  query  as  distinct  query  sessions.  We  use 
document  impression  to  refer  to  the  web  documents  (or  URLs)  presented  in  the  first  result  page, 
and  discard  other  page  elements  such  as  sponsored  ads  and  related  search.  The  document  im¬ 
pression  can  be  represented  as  d  =  {di, . . . ,  o?m}  (usually  M  =  10),  where  di  is  an  index  into 
a  set  of  documents  for  the  query,  i.e.,  d\  denotes  the  document  shown  on  the  top  position.  A 
document  is  in  a  higher  position  (or  rank)  if  it  appears  before  those  in  lower  positions. 

Examination  and  click  events  for  impressed  documents  are  treated  in  a  probabilistic  way.  For 
a  given  query  session,  we  use  binary  random  variables  Et  and  C%  to  represent  the  examination 
event  and  the  click  event  at  position  i  respectively.  Therefore,  P(Ei  =  1)  and  P(Ci  =  1)  are  the 


89 


examination  probability  and  the  click  probability  at  the  same  position. 

The  examination  hypothesis  [94],  cascade  hypothesis  and  the  cascade  model  [33]  have  al¬ 
ready  been  introduced  in  Section  6.1.2  where  the  dependent  click  model  [51]  is  presented. 
The  remainder  of  this  section  is  devoted  to  another  click  model  known  as  the  user  browsing 
model  [37]. 

The  user  browsing  model,  or  UBM,  is  also  based  on  the  examination  hypothesis,  but  does  not 
follow  the  cascade  hypothesis.  Instead,  it  assumes  that  the  examination  probability  Et  depends 
on  both  i  and  the  previous  clicked  position  1,  =  argma xt<i{Ci  =  1}: 

P(Ei  —  l\Ci, . . . ,  Ci-i)  —  7^.  (A.l) 

If  there  is  no  click  before  i,  then  lt  =  0,  therefore  0  <  lt  <  i  <  M.  The  total  number  of  user 
behavior  parameters  is  M(M  +  1) / 2,  which  are  again  shared  by  all  query  sessions. 

Under  UBM,  the  log-likelihood  for  each  query  session  is 

M 

Kl)  =  (' Ci  log  rdi  +  Ci  log  Ti.ii  +  (!  -  ci)  iogl1  -  r ATUi))  (A.2) 

i= 1 

The  coupling  of  relevance  r  and  parameter  7  introduced  in  the  second  term  makes  exact  com¬ 
putation  intractable.  The  algorithm  could  alternative  be  carried  out  in  an  iterative,  coordinate- 
ascent  fashion.  However,  we  found  that  the  fix-point  equation  update  proposed  in  [37]  does  not 
have  convergence  guarantee  and  is  sub-optimal  in  certain  cases.  Instead,  we  designed  the  follow¬ 
ing  update  scheme  which  takes  a  few  dozen  iterations  before  achieving  convergence  according 
to  our  evaluation  over  a  real-world  click  log  data  set. 

Given  a  query  for  each  document  d,  we  keep  its  count  of  click  K d  and  non-click  L,i,  j  where 
1  <  j  <  M(M  +  1) / 2,  and  they  map  one-to-one  with  all  possible  7  indices,  and  1  <  d  <  D 
maps  one-to-one  to  all  query-document  pair.  Similarly,  for  each  7 j,  we  keep  its  count  of  click  K;] . 
Then  given  the  initial  value  of  7  =  70,  optimization  of  relevance  can  be  carried  out  iteratively, 

M(M+ 1)/2 

r^+1  =  arg  max  l  Kd  log  r  +  V'  Ld  j  log(l  -  ry-)  \ 

0<r<l  l  — '  J  ) 

3= 1 

D 

7j+1  =  arg  max  j/^  logy  +  ^  LdJ  log(l  -  r*+1y)} 

7  d= 1 

for  all  possible  d  and  j  respectively.  The  “arg-max”  operation  can  be  carried  out  by  evaluating 
the  objective  function  for  a  sequential  scan  of  r  or  7  values  between  0  and  1.  And  we  suggest  a 
granularity  of  0.01  for  both  scans. 

A.3  Click  Chain  Model 

The  CCM  model  is  based  on  the  generative  process  of  user  interaction  illustrated  in  the  form 
of  a  flowchart  in  Figure  A.l.  The  user  initiates  the  examination  of  the  search  result  in  each 


(A.3) 

(A.4) 
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Done 

Figure  A.l:  Our  proposed  user  model  of  CCM,  in  which  Rri  is  the  relevance  variable  of  di  at 
position  i,  and  a’s  form  the  set  of  user  behavior  parameters. 


query  session  from  the  top  ranked  document.  At  each  position  i,  the  user  can  choose  to  click 
or  skip  the  document  di  according  to  the  perceived  relevance  Rt.  Since  R,  is  a  random  variable 
under  CCM,  we  can  draw  a  sample  from  the  distribution  of  Rt  to  determine  the  chance  of  click. 
Whatever  the  outcome  for  this  click  decision,  the  user  can  choose  to  continue  the  examination 
or  abandon  the  current  query  session.  The  probability  of  continuing  to  examine  di+i  depends  on 
the  click  decision  at  position  i.  Specifically,  if  the  user  skips  di,  this  probability  is  an  (unknown) 
constant  a3;  on  the  other  hand,  if  the  user  clicks  di,  the  probability  to  examine  dl+\  depends  on 
the  user-perceived  relevance  R.t  and  range  between  a3  and  a2- 

CCM  shares  the  following  assumptions  with  the  cascade  model  and  DCM:  (1)  users  are  ho¬ 
mogeneous:  their  information  needs  are  similar  given  the  same  query;  (2)  decoupled  examination 
and  click  events:  the  click  probability  is  solely  determined  by  the  examination  probability  and 
the  document  relevance  at  a  given  position;  (3)  cascade  examination:  examination  is  in  strict 
sequential  order  without  breaks. 

CCM  distinguishes  itself  from  other  click  models  by  representing  document  relevance  as 
random  variables  and  performing  (approximate)  Bayesian  inference  to  infer  their  posterior  dis¬ 
tributions.  Model  fitting  of  CCM  includes  both  inferring  the  posterior  distribution  of  Ri  and 
estimating  user-behavior  parameters  a.  =  {a3,  a2,  0.3}.  The  posterior  distribution  could  be  fur¬ 
ther  leveraged  for  applications  such  as  automated  ranking  alterations  because  it  is  possible  to 
derive  confidence  measures  and  other  useful  features  using  standard  statistical  techniques  [73]. 

The  graphical  model  of  CCM  for  one  query  session  is  shown  in  Figure  A. 2.  There  are  three 
layers  of  random  variables:  for  each  position  i,  Ei  and  Ct  denote  examination  and  click  events 
as  usual,  whereas  R.t  is  the  user-perceived  relevance  of  di.  The  click  layer  is  fully  observed  given 
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Figure  A. 2:  The  graphical  model  representation  of  CCM.  Shaded  nodes  are  observed  click  vari¬ 
ables. 


the  click  log.  CCM  is  named  after  the  chain  structure  of  variables  representing  the  sequential 
examination  and  clicks  through  every  position  in  the  search  result. 


The  following  conditional  probabilities  are  defined  for  Figure  A. 2  according  to  the  modeling 
assumption: 


P{Ci  =  l\Ei  =  0)  =  0 
P(Ci  =  1| Ei  =  1 ,  Ri)  =  Ri 

P(Ei+i  =  l\Ei  =  0)  =  0  (A. 5) 

P{Pi+ 1  =  1|  Ei  =  1)  Ci  =  0)  =  Oil 

P(Ei- i-i  —  1|  Ei  =  1,  =  1,  Ri)  =  012(1  —  Ri)  +  oi^Ri 


To  complete  the  model  specification  we  let  P(E\  =  1)  =  1  and  impose  independent  uniform 
priors  over  [0,1]  for  every  document  relevance  variable  R,t,  i. e. ,  j>(R,)  =  1  for  0  <  R,  < 
1.  Note  that  in  this  model  specification,  we  are  not  putting  a  limit  on  the  length  of  the  chain 
structure.  Instead  we  allow  the  chain  to  go  infinitely.  We  will  discuss,  in  the  next  section,  that 
this  choice  simplifies  the  inference  algorithm,  provides  an  order  of  magnitude  savings  in  space, 
and  sacrifices  little  performance  since  the  chance  of  examination  diminishes  exponentially  after 
the  last  click.  An  alternative  is  to  truncate  the  click  chain  to  a  finite  length  of  M.  The  single-pass 
inference  and  learning  algorithms  detailed  hereinafter  for  the  (standard)  CCM  could  be  adapted 
to  this  truncated  version,  to  offer  more  accurate  characterization  of  the  posterior  when  the  last 
clicked  position  is  close  to  M. 
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A.4  Algorithms 


This  section  describes  inference  and  learning  algorithms  for  the  CCM.  We  start  with  the  al¬ 
gorithm  for  computing  the  posterior  distribution  over  the  relevance  of  each  document  in  Sec¬ 
tions  A.4.1  and  A. 4. 2.  Then  we  describe  how  to  estimate  the  at  parameters  in  Section  A. 4. 3.  We 
will  also  provide  a  brief  discussion  in  Section  A.4. 2  to  show  that  by  keeping  appropriate  counts 
as  sufficient  statistics,  it  is  straightforward  to  extend  these  algorithms  to  incrementally  update  the 
relevance  posteriors  and  at  parameters  with  newly  available  data. 

Given  the  click  data  C  =  {C1, . . . ,  Cu}  from  U  query  sessions  for  the  query  of  interest,  we 
want  to  compute  the  posterior  p(Ri\C)  for  the  relevance  of  a  document  i.  For  a  single  query 
session,  this  posterior  can  be  efficiently  computed  due  to  the  chain  structure  of  the  graphical 
model  (Figure  A. 2),  and  the  distribution  function  is  polynomial  with  respect  to  Ri.  However, 
given  multiple  query  sessions,  exact  inference  becomes  intractable  due  to  sharing  of  relevance 
variables  between  query  sessions.  For  example,  if  documents  i  and  j  may  both  appear  in  two 
sessions  at  different  positions,  their  posteriors  are  generally  dependent  on  each  other.  To  carry  out 
approximate  inference,  we  could  resort  to  off-the-shelf  iterative  algorithms  such  as  expectation 
propagation  [79].  However,  to  scale  the  model  to  terabyte  data,  we  propose  a  faster  method  that 
requires  only  a  single  pass. 

The  approximation  is  that,  when  computing  the  posterior  for  Ri,  we  assume  that  the  clicks  in 
query  sessions  are  conditionally  independent  given  R, : 

u 

p(Ri\C )  ~  (constant)  x  p(Ri)  P(Cu\Ri).  (A. 6) 

1 

Since  the  prior  p(Ri)  is  given,  we  only  need  to  compute  P  ( Cu  |  Rt)  for  each  query  session  u,  up 
to  a  constant  w.r.t.  R.,  ,  in  order  to  obtain  an  un-normalized  version  of  the  posterior.  Section  A.4. 1 
will  elaborate  on  the  computation  of  this  conditional  probability  P  ( Cu|  Ri),  and  the  superscript 
u  is  omitted  in  the  following  as  we  focus  on  a  single  query  session. 

A.4.1  Deriving  the  Conditional  Probabilities 

Before  diving  into  the  details  of  deriving  P(C  R, )  for  a  particular  session,  we  first  highlight  the 
following  three  properties  of  CCM,  which  will  greatly  simplify  the  variable  elimination  proce¬ 
dure  as  we  take  sum  or  integration  over  all  hidden  variables  other  than  R, : 

1.  If  Cj  =  1  for  some  j,  then  Vi  <  j,  Ei  =  1. 

2.  If  Ej  =  0  for  some  j,  then  Vi  >  j,  Ei  =  0,  6',  =  0. 

3.  If  Ei  =  1,  Ci  =  0,  then  P(Ei+i\Ei,  Ci,  Ri)  =  afi+1(l  —  Oi\)l~Ei+1  does  not  depend  on  Ri. 

Property  (1)  states  that  every  document  before  the  last  clicked  position  is  examined,  so  for  these 
documents,  we  only  need  take  care  of  different  values  of  random  variables  within  its  Markov 
blanket  in  Figure  A. 2,  which  consists  of  Ci,  Et  and  Ei+1.  Property  (2)  is  a  corollary  from  the 
cascade  hypothesis,  and  it  reduces  the  cost  of  eliminating  E  variables  from  exponential  time  to 
linear  time  by  using  branch-and-conquer.  Property  (3)  is  derived  from  the  model  specification 
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Figure  A. 3:  Different  cases  for  computing  P( C  R, )  up  to  a  constant  where  l  indicates  the  last 
clicked  position.  Darker  nodes  in  the  figure  above  represent  clicks  at  these  positions,  whereas 
lighter  nodes  represent  skips. 

(Eq.  A. 5),  and  it  enables  re-arrangement  of  the  sum  operation  over  E  and  R  variables  to  minimize 
the  computational  effort. 

From  property  1,  we  know  that  the  last  clicked  position  l  =  argmax1<i<M{C'i  =  1}  plays  an 
important  role.  Figure  A. 3  lists  the  complete  results  separated  into  two  categories,  depending  on 
whether  there  is  any  click  in  the  current  query  session. 

Derivation  for  each  case  is  presented  in  the  following: 

Case  1:  i  <  l,  Ci  =  0 

By  property  1  and  property  3,  =  1,  Ei+X  =  1  and  P(Ei+1  =  1\E,  =  1,  C,  =  0,  R, )  = 

a\  does  not  depend  on  Ri.  Since  any  constant  w.r.t.  Ri  can  be  ignored,  we  have 

P{ C  | Ri)  oc  P(Ci  =  0| Et  =  1, Ri)  =  1  -Ri  (A.7) 

Case  2:  i  <  l,  Ci  =  1 

By  property  1,  Ei  =  1,  Ei+ 1  =  1,  the  Markov  blanket  of  Ri  consists  of  Ci:  Ei  and  Ei+i. 

P(C\ R,i)  oc  P{Ci  =  l\Ei  =  1 ,  Ri)  P(Ei+ 1  =  l\Ei  —  1,  C,-  —  l,Ri) 

oc  Ri(  1  -  (1  -  a3/a2)Ri )  (A.8) 
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Case  3 :  i  =  l 

By  property  1,  the  Markov  blanket  of  R,  does  not  contain  any  variable  before  position  i, 
and  we  also  know  that  C,  —  E,  =  \  and  Vj  >  i.  Cj  =  0.  We  need  to  take  sum/integration 
over  all  the  Ej,  Rj  variables  where  3  >  i  and  it  can  be  performed  as  follows: 


P(C  |.Rj)  oc  P(Ci  —  l\Ei  —  1,  Ri)  ■  E  / . II 

{Ej\j>i}  1  j>i 

(pfalEj-r,  Cj- 1,  Rj- 1)  •  p(Rj)  ■  P(Cj\Ej ,  Rj)^J 

—  Ri  •  =  0| Ei  —  1,  Ci  —  1,  Ri)  ■  1 

+  P(E*i+ 1  —  1| Ei  —  1)  Ci  —  1,  Ri)  ■ 

p{Ri+ i)  P(Ci+ 1  =  0|-Ej+i  =  1,  Ri+\)dRi+\ 

P(Ei+ 2  =  0|Ei+i  =  1, C'+i  =  0)  •  1 

+  -P(f?j_|_2  =  l|£?i+i  =  l,Ci+i  =  0)  • 

(“1 

P(Ri+ 2)  P(Ci+2  =  0|f?j_|_2  =  1, -Rj+2)d-Rj+2 


—  -R?  ^(1  —  0:2(1  —  Ri)  —  a^Ri)  +  (0:2(1  —  Ri)  +  oc^Ri)  • 

i  •  ((l-oi)  +  oi-i  •(•••))) 

OC  Ri  (1+  a2~as  Rj]  (A. 9) 

\  2  —  ai  —  0L2  ) 


Case  4 :  i  >  l 

Now  the  result  will  be  a  function  of  k  —  i  —  l:  the  offset  of  the  document  from  the  last 
clicked  position.  The  branch-and-conquer  approach  we  take  to  sum  over  variables  in  the 
Markov  blanket  of  R,  is  similar  to  Case  3.  The  major  difference  in  the  equation  below  is 
that  we  are  integrating  the  outmost  Ri  while  leaving  the  R,  inside  without  the  integration, 
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and  we  will  replace  the  dummy  variable  Ri  by  R  for  easier  reading: 


P{ C  | Ri 


f  f0  R  ^(1  —  0:2(1  —  R)  —  03)/?)  +  (0:2(1  —  R)  +  03 )R)  ■  (1  —  Ri)  ■  11L ai/2) 


dR 


oc  < 


(*  =  1) 


fo  R  ((1  -  o2(l  -  R)  -  o3 )R)  +  (0:2(1  ~R)  +  a3)R)  •  E-ZoM? V 


+  (o1/2)fc-1(l 


dR 


(k>  1) 


—  f  P^(l  —  o2(l  —  R)  —  03 )R)  +  (o2(l  —  R)  +  0:3 )R)' 


(1  -  ai)(l  +  {aifoy-1  -  2(q1/2)fc~1fli) 

2  —  on 


dR 


oc  1  — 


1+  fr3aA7^2^(2/o1)fc-1 


i?,: 


(A.  10) 


(1— ai)(a2+2«3) 


Case  5:  (No  click) 

When  /'  =  1,  we  know  E\  =  1,  therefore  P(E‘2\El  =  1,  C\  =  0)  does  not  depend  on  R \ . 
We  have 

P{ C  | R^  oc  P(Ci  =  0|Pi  =  l,Pi)  =  1  -Rl  (A.  11) 


When  i  >  1  the  derivation  is  similar  to  Case  4  and  is  much  simpler  since  there  is  no  o2,  o3 
term: 


P(C  |.Rj)  oc  [  {l  ~R)dR[(l-a1)-:V( 

Jo  \  j=o 


(a1/2y  +  (a1/2y-'2a1(l-Rl) 


1  —  Oi 

1  -  oi/2 


oc  (1  -  O!)1  n  (Q'l/2)*  1  +  2(o1/2)i_1(l  -  Ri)-  1  ai 


1  -  oi/2 


1  —  o:i/2 


oc  1 


1  +  (2/ 01) 


—  Ri 


(A. 12) 


Eq.  A.  1 1  can  be  treated  as  a  special  case  of  Eq.  A.  12  when  i  —  1. 

Both  case  4  and  case  5  need  to  take  sum  over  latent  variables  after  the  current  position  i, 
and  results  depend  on  the  distance  from  the  last  clicked  position.  Therefore  the  total  number  of 
distinct  terms  obtained  in  these  5  cases  when  1  <  i  <  M  are  1  +  1  +  1  +  (M— 1)+M  =  2M+2.  If 
we  impose  a  finite  chain  length  M  on  CCM  and  let  P(EM+1)  =  0  in  Figure  A. 2,  then  the  number 
of  distinct  results  would  be  M2+ 2,  which  is  an  order  of  magnitude  higher  than  the  current  design, 
and  further  increases  the  cost  of  subsequent  steps  of  inference  and  parameter  estimation. 


A.4.2  Computing  the  Posterior 

All  the  conditional  probabilities  in  Figure  A. 3  for  a  single  query  session  can  be  written  in  the  form 
of  Rfi(  1  —  (3jRi)  where  fi3  is  a  case-specific  coefficient  whose  value  depend  on  user  behavior 
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Algorithm  3  :  Computing  the  Un-normalized  Posterior 

Input:  Click  Data  C  ( M  x  U  matrix); 

CiU  =  1  if  user  u  clicks  the  document  at  position  i 
Impression  Data  V  ( M  x  U  matrix); 

Dlu  —  d  if  the  document  d  is  impressed  at  position  i 
to  user  u,  1  <  d  <  N 
Parameters  a 

Output:  Coefficients  (3  ((2 M  +  2) -dim  vector) 

Exponents  T  (N  x  (2 M  +  2)  matrix) 

Compute  (3  using  the  results  in  Figure  A. 3. 

Initialize  T  by  setting  all  its  elements  to  0. 

for  1  <  u  <  U 

for  1  <  i  <  M 

Identify  the  linear  factors  using  the  click  and  impression  data, 
let  f3j  be  the  corresponding  coefficient  for  the  current  position, 

Tfj  j  —  Tfj  j  1. 

_ J-^iu  lj _ ^iu  ij  _ 

Time  Complexity:  O(MU). 

Extra  Space:  O(MN). 

(Usually  U  >  N  »  M  =  10.) 


parameters  a.  Let  M  be  the  number  of  web  documents  in  the  first  result  page  and  it  usually 
equals  10.  There  are  at  most  (2 M  +  2)  different  coefficients  from  the  5  cases  as  discussed  above. 
From  Eq.  A. 6,  we  know  that  the  un-normalized  posterior  p(Ri\C)  is  a  polynomial  function  of 
Ri  which  consists  of  at  most  (2 M  +  3)  distinct  linear  factors  (including  R:t  itself),  so  given  user 
behavior  parameters  ct,  we  only  need  to  record  (2M  +  2)  exponents  as  sufficient  statistics  to  fully 
characterize  the  posterior  distribution  for  every  query-document  pair.  The  detailed  algorithm  for 
parsing  through  the  click  log  and  updating  exponents  is  listed  in  Algorithm  3.  Given  output 
exponents  T  as  well  as  coefficients  (3  from  Algorithm  3,  the  un-normalized  relevance  posterior 
of  a  document  indexed  by  d  is 


2M+2 

Pd(r)  oc  rTd’2+Td’3  ( 1  —  j8jr)TdJ  (A. 13) 

j= 1 

Furthermore,  when  new  click  data  becomes  available,  we  can  run  Algorithm  3  to  update 
exponents  stored  in  T  by  incrementing  the  counts  for  each  query-document  pair.  The  compu¬ 
tational  time  is  thus  O(MU'),  where  U'  is  the  number  of  query  sessions  in  the  new  data.  Extra 
space  is  only  needed  only  when  there  are  previously  unseen  web  documents  of  interest. 

Eq.  A.  13  gives  the  analytical  formula  of  the  un-normalized  posterior  as  a  polynomial  function 
whose  order  depend  on  the  number  of  impressions  and  clicks  of  the  corresponding  document. 
To  evaluate  the  normalization  constant,  we  could  perform  integration  of  Pd(r)  over  r  G  [0, 1]  in  a 
straightforward  way  by  sequentially  multiplying  all  the  linear  factors  to  obtain  every  coefficient 
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Algorithm  4  :  Numerical  Integration  for  Posteriors 
Input:  Exponents  T  (N  x  (2 M  +  2)  matrix) 

Coefficients  (3  ((2 M  +  2) -dim  vector) 

Number  of  bins  B 

Output:  Posterior  moments  T  (N  x  K  matrix) 

Fdk  is  the  /.  th  posterior  moment  for  document  d 

Create  a  N  x  (K  +  1)  matrix  T  for  intermediate  results. 

Create  a  /i-dimcnsional  vector  r  to  keep  the  center  of  each  bin: 

rb  —  (b  —  0.5 )/B  for  1  <  b  <  B 
for  1  <  k  <  K  +  1 
for  1  <  d  <  N 

Fdk  =  log  (  Ya,=  \  exP  (  -  log(-B)  +  (Td 2  +  Td3  +  k  -  1)  •  log(r6) 
+  Ej=i+2  Tdj  log(l  +  Pin))}- 

for  1  <  k  <  K 

for  1  <  d  <  N 

_ Fdk  =  exp  (, Fd)k+l  -  F\i) _ 

Time  Complexity:  O(KMNB). 

Extra  Space:  0(KN  +  MB). 


of  the  polynomial  function.  However,  the  accuracy  suffers  significantly  from  rounding  errors  as 
the  size  of  the  input  and  the  order  of  the  polynomial  functions  goes  up,  with  a  more  expensive 
computational  cost  as  well.  To  finesse  this  numerical  problem,  we  propose  Algorithm  4  for 
computing  posterior  moments  using  the  midpoint  rule  with  B  equal-size  bins  to  approximate  the 
integral  and  posterior  moments.  B  =  100  is  usually  sufficient  in  most  cases,  though  the  number 
of  bins  could  be  adaptively  set  if  necessary.  Let 

1  B  2M+2 

fd(k)=  rkpd(r)dr  ^ —y2rld'2+Td’3+k  T[  (1+ ^rb)Td’\  (A.  14) 

*  6=i  j=i 

then  the  estimation  for  the  A:th  moment  for  document  d  is  fd(k) /./)/(0).  To  avoid  numerical 
problems  in  the  implementation,  we  usually  take  sum  of  log  values  for  the  linear  factors  instead 
of  multiplying  them  directly.  And  the  above  procedure  could  be  extended  to  compute  other 
posterior  estimates  depending  on  the  application  scenario. 


A.4.3  Parameter  Estimation 

We  perform  approximate  maximum-likelihood  estimation  to  leam  model  parameters  a.  The 
approximation  is  based  on  the  same  assumption  we  proposed  at  the  beginning  of  this  section  by 
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imposing  independent  priors  for  document  relevance  variables  across  query  sessions,  such  that 


r  u  u  r 

P{C)=  /  p(R)  Yl  /  p(R)p(C“|R)dR,  (A.15) 

where  R  =  {i?j|l  <  i  <  iV }  is  the  set  of  document  relevance  variables  for  the  query  of  interest. 
And  to  compute  the  log-likelihood  function,  we  simply  need  to  compute  the  last  integral  in 
the  previous  equation  for  each  query  session  and  sum  over  log  values  across  the  data  set.  The 
derivation  of  the  value  of  the  integral  is  analogous  to  that  in  Section  A.4.1,  and  we  could  write 
the  resulting  approximate  log-likelihood  function  as  follows: 

£(oc)  =  N4  logai  +  N2  log(a2  +  2 a3)  +  N3  log(6  -  3«i  -  (a2  +  2 a3)) 

+  N5  log(l  -  ai)  -  (N3  +  N5)  log(2  -  cti)  +  (constant)  (A.  16) 


where  Nt  is  the  total  number  of  times  documents  fall  into  case  i  in  Figure  A. 3  in  the  click  data. 
By  maximizing  this  approximate  log-likelihood  w.r.t.  a,  we  have 


3 N3  +  N2  +  N5-  y/{3N i  +  N2  +  N5)2  -  8Ai(Ai  +  N2) 

2(Ni  +  N2) 


(A.  17) 


and 


ol2  +  2a3 


3N2(2  —  ai) 
N2  +  N3 


(A. 18) 


Eq.  A.  18  leave  a  degree  of  freedom  in  the  choice  of  a2  and  0:3  values,  which  is  introduced  by  the 
iid  uniform  priors  over  all  documents  and  the  approximation  scheme  we  took.  We  can  assign  a 
value  to  a2/a3  according  to  the  context  of  the  model  application  (more  on  this  in  experiments). 
Note  that  parameter  values  do  not  depend  on  N4  when  the  chain  length  is  infinite. 


A.5  Performance  Evaluation 

In  this  section,  we  report  on  the  performance  evaluation  and  comparison  based  on  a  data  set 
with  8.8  million  query  sessions  that  are  uniformly  sampled  from  a  commercial  search  engine. 
We  measure  the  performance  of  three  click  models  with  a  number  of  evaluation  metrics.  Our 
results  show  that  CCM  consistently  outperforms  its  competitors  UBM  and  DCM  with:  (1)  over 
9.7%  better  log-likelihood  (Section  A. 5. 2);  (2)  over  6.2%  improvement  in  click  perplexity  (Sec¬ 
tion  A. 5. 3);  (3)  more  robust  (up  to  30%)  click  statistics  prediction  (Section  A. 5.4).  As  these 
widely  used  metrics  measure  model  performance  from  different  aspects,  the  uniformly  better 
results  demonstrate  that  CCM  robustly  captures  user  clicks  in  a  number  of  contexts.  Finally, 
in  Section  A. 5. 5,  we  present  the  empirical  examine  and  click  distribution  curves,  which  help 
illustrate  the  differences  in  modeling  assumptions. 
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A.5.1  Experimental  Setup 

The  experiment  data  set  is  identical  to  the  one  described  in  Section  6.3.1  For  each  query,  we 
compute  document  relevance  and  position  relevance  based  on  each  of  the  three  models,  respec¬ 
tively.  Position  relevance  is  computed  by  treating  each  position  as  a  pseudo-document.  The 
position  relevance  can  substitute  the  document  relevance  estimates  for  documents  that  appear 
zero  or  very  few  times  in  the  training  set  but  do  appear  in  the  test  set.  This  essentially  smoothes 
the  predictive  model  and  improves  the  performance  on  the  test  set.  The  cutoff  is  set  adaptively 
to  [2 log10(Query  Frequency)].  For  CCM,  the  number  of  bins  B  is  set  to  100  which  already 
provides  an  adequate  level  of  accuracy. 

To  account  for  heterogeneous  browsing  patterns  for  different  queries,  we  fit  different  mod¬ 
els  for  navigational  queries  and  informational  queries  and  learn  two  sets  of  parameters  for  each 
model  according  to  the  median  of  click  distribution  over  positions  [49,  71].  In  particular,  CCM 
sets  the  ratio  a2/as  equal  to  2.5  for  navigational  queries  and  1.5  for  informational  queries,  be¬ 
cause  a  larger  ratio  implies  a  smaller  examination  probability  after  a  click.  The  value  of  a\  equals 
1  as  a  result  of  discarding  query  sessions  with  no  clicks  (jV5  =  0  in  Eq.  A.  17). 

To  evaluate  model  performance  on  test  data,  we  compute  the  log-likelihood  and  other  statis¬ 
tics  needed  for  each  query  session  using  the  document  relevance  and  user  behavior  parameter 
estimates  learned/inferred  from  training  data.  In  particular,  by  assuming  independent  document 
relevance  priors  in  CCM,  all  the  necessary  statistics  can  be  derived  in  closed  form,  as  summa¬ 
rized  in  [50]  (Appendix  B).  Finally,  to  avoid  infinite  values  in  log-likelihood,  a  lower  bound  of 
0.01  and  a  upper  bound  of  0.99  are  applied  to  document  relevance  estimates  for  DCM  and  UBM. 

All  experiments  described  in  the  remainder  of  this  section  were  carried  out  on  a  64-bit  server 
with  32GB  RAM  and  eight  2.8GHz  cores  with  MATLAB  2008b  installed.  The  total  time  of 
model  training  for  UBM,  DCM  and  CCM  is  333  minutes,  5.4  minutes  and  9.8  minutes,  respec¬ 
tively.  For  CCM,  obtaining  sufficient  statistics  (Algorithm  3),  parameter  estimation,  and  poste¬ 
rior  computation  (Algorithm  4)  account  for  54%,  2.0%  and  44%  of  the  total  time,  respectively. 
For  UBM,  the  learning  algorithm  converges  in  34  iterations. 

A.5.2  Results  on  Log-Likelihood 

Log-likelihood  (LL)  is  widely  used  to  measure  model  fitness.  Given  the  document  impression  for 
each  query  session  in  the  test  data,  LL  is  defined  as  the  log  probability  of  observed  click  events 
computed  under  the  trained  model.  A  larger  LL  indicates  better  performance,  and  the  optimal 
value  is  0.  The  improvement  of  LL  value  tx  over  £2  is  computed  as  (exp^  —  £2)  —  1)  x  100%. 
We  report  average  LL  over  multiple  query  sessions  using  arithmetic  mean  values. 

Ligure  A.4  presents  LL  curves  for  the  three  models  over  different  frequency  categories.  The 
average  LL  over  all  query  sessions  is  -1.171  for  CCM,  which  means  that  with  31%  chance,  CCM 
predicts  user  clicks  exactly  over  top  10  positions  (out  of  the  210  possibilities)  for  a  query  session 
in  the  test  data  set.  The  average  LL  is  -1.264  for  UBM  and  -1.302  for  DCM,  with  correspond¬ 
ing  improvement  ratios  to  be  9.7%  and  14%  respectively.  Thanks  to  the  Bayesian  modeling  of 
document  relevance,  CCM  outperforms  the  other  two  models  more  significantly  on  less  frequent 
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Figure  A.4:  Log-likelihood  per  query  session  on  test  data  for  different  query  frequencies.  The 
overall  average  for  CCM  is  -1.171,  9.7%  better  than  UBM  (-1.264)  and  14%  better  than  DCM 
(-1.302). 


queries,  as  indicated  in  Figure  A.4.  Fitting  different  models  for  navigational  and  informational 
queries  leads  to  2.5%  better  LL  for  DCM  compared  with  a  previous  implementation  in  [51]  on 
the  same  data  set  (average  LL  =  -1.327),  which  is  consistent  with  our  results  [49]  obtained  from 
another  data  source. 

A.5.3  Results  on  Click  Perplexity 

Click  perplexity  was  used  as  the  evaluation  metric  previously  in  [37].  It  is  derived  from  the  en¬ 
tropy  measure  for  binary  click  events  at  each  position  in  a  query  session  independently,  whereas 
log-likelihood  computation  is  based  on  the  whole  click  vector.  For  a  set  of  query  sessions  indexed 
by  n  =  1, . . . ,  N,  if  we  use  <{'■  to  denote  the  probability  of  click  derived  from  the  click  model  on 
position  i  and  Cf  to  denote  the  corresponding  binary  click  event,  then  the  click  perplexity 

Pi  =  2-w  EL  i(c?  lo§2  <??+(! -qp)  iog2(i-9”))_  (A.  19) 

A  smaller  perplexity  value  indicates  higher  prediction  quality,  and  the  optimal  value  is  1 .  The 
improvement  of  perplexity  values  pi  over  p2  is  given  by  (p2  —  p\ ) / (p2  —  1)  x  100%. 

The  average  click  perplexity  over  all  query  sessions  and  positions  is  1.1479  for  CCM,  which 
gives  a  6.2%  improvement  per  position  over  UBM  (1 . 1577)  and  a  7.0%  improvement  over  DCM 
(1.1590).  Again,  CCM  outperforms  the  other  two  models  more  significantly  on  less  frequent 
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(a)  Frequency  (b)  Position 

Figure  A. 5:  Perplexity  results  on  test  data  for  (a)  for  different  query  frequencies  or  (b)  different 
positions.  Average  click  perplexity  over  all  positions  for  CCM  is  1.1479,  6.2%  improvement 
over  UBM  (1.1577)  and  7.0%  over  DCM  (1.1590). 

queries  than  on  more  frequent  queries.  As  shown  in  Figure  A. 5(a),  the  improvement  ratio  is  over 
26%  when  compared  with  DCM  and  UBM  on  the  least  frequent  query  category.  Figure  A. 5(b) 
illustrates  that  click  prediction  quality  of  CCM  is  the  best  of  the  three  models  for  every  position, 
whereas  DCM  and  UBM  are  almost  comparable  with  each  other.  Perplexity  values  for  CCM  on 
position  1  and  position  10  equal  the  perplexity  of  predicting  the  outcome  of  tossing  a  biased  coin 
of  with  known  p(Head)  =  0.099  and  p(Head)  =  0.0117  respectively. 

A.5.4  Last-Click  Prediction 

Root-mean-square  (RMS)  error  on  click  statistics  prediction  was  used  as  an  evaluation  metric 
previously  in  [5 1].  It  is  calculated  by  comparing  the  observation  statistics  such  as  the  first  clicked 
position  or  the  last  clicked  position  in  a  query  session  with  the  model  predicted  position.  Predict¬ 
ing  the  last  clicked  position  is  particularly  challenging  for  click  models  while  predicting  the  first 
clicked  position  is  a  relatively  easy  task.  We  evaluate  the  performance  of  last-click  prediction  on 
the  test  data  under  two  settings  and  present  results  in  Figure  A. 6. 

First,  given  the  impression  data,  we  compute  the  distribution  of  the  last  clicked  position  in 
closed  form,  and  compare  the  expected  value  with  the  observed  statistics  to  compute  the  RMS 
error.  The  optimal  RMS  value  under  this  setting  is  approximately  the  standard  deviation  of  last 
clicked  positions  over  all  query  sessions  for  a  given  query,  and  it  is  included  in  Figure  A. 6  as  the 
“optimal-exp”  curve.  We  expect  that  a  model  that  gives  consistently  good  fit  of  click  data  would 
have  a  small  margin  with  respect  to  the  optimal  error.  The  improvement  of  RMS  error  values  e\ 
over  e2  w.r.t.  the  optimal  value  e*  is  given  by  (e2  —  ei) / (e2  —  e*)  *  100%.  We  report  average  error 
by  taking  the  RMS  mean  over  all  query  sessions.  The  optimal  RMS  error  under  this  setting  for 
last  clicked  position  is  1.443,  whereas  the  error  of  CCM  is  1.548,  which  is  9.8%  improvement 
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Figure  A. 6:  Root-mean- square  (RMS)  errors  for  predicting  the  last  clicked  position.  Prediction 
is  based  on  the  SIMulation  for  solid  curves  and  Expectation  for  dashed  curves. 

for  DCM  (1.560)  but  only  slightly  better  than  UBM  (0.2%). 

Under  the  second  setting,  we  simulate  query  session  clicks  from  the  model,  collect  those 
samples  with  clicks,  compare  the  last  clicked  position  against  the  ground  truth  in  the  test  data 
and  compute  the  RMS  error,  in  the  same  way  as  [51].  The  number  of  samples  in  the  simulation 
is  set  to  10.  The  optimal  RMS  error  is  the  same  as  in  the  previously  discussed  setting,  but 
it  is  much  more  difficult  to  achieve  this  lower  bound  under  the  current  setting  because  errors 
from  simulations  reflect  both  biases  and  variances  of  the  prediction.  We  report  the  RMS  error 
margin  for  the  same  model  between  the  two  settings  as  follows;  a  more  robust  click  model  should 
have  a  smaller  margin.  The  error  margin  on  last-click  prediction,  the  margin  is  0.460  for  CCM, 
compared  with  0.566  for  DCM  (23%  larger)  and  0.599  for  UBM  (30%  larger). 

In  summary,  CCM  is  the  best  of  the  three  models  in  this  experiment,  to  predict  the  first  and 
the  last  clicked  position  effectively  and  robustly. 

A.5.5  Position-bias  of  Examination  and  Click 

Model  click  distribution  over  positions  are  the  averaged  click  probabilities  derived  from  click 
models  based  on  the  document  impression  in  the  test  data.  It  reflects  the  position-bias  implied 
by  the  click  model  and  can  be  compared  with  the  objective  ground  truth-the  empirical  click 
distribution  over  the  test  set.  Any  deviation  of  model  click  distribution  from  the  ground  truth 
would  suggest  the  existing  modeling  bias  in  clicks.  Note  that  on  the  other  side,  a  close  match 
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Figure  A.7:  Examination  and  click  probability  distributions  over  the  top  10  positions. 


does  not  necessarily  imply  excellent  click  prediction,  for  example,  prediction  of  clicks  {00, 11} 
as  {01, 10}  would  still  have  a  perfect  empirical  distribution.  Model  examination  distribution  over 
positions  can  be  computed  in  a  similar  way  to  the  click  distribution.  But  there  is  no  ground  truth 
to  be  contrasted  with.  Under  the  examination  hypothesis,  the  gap  between  examination  and  click 
curves  of  the  same  model  reflects  the  average  document  relevance. 

Figure  A.7  illustrates  the  model  examination  and  click  distribution  derived  from  CCM,  DCM 
and  UBM  as  well  as  the  empirical  click  distribution  on  the  test  data.  All  three  models  under¬ 
estimate  the  click  probabilities  in  the  last  few  positions.  CCM  has  a  larger  bias  due  to  the  ap¬ 
proximation  of  infinite-length  in  inference  and  estimation.  This  approximation  is  immaterial  as 
CCM  gives  the  best  results  in  the  above  experiments.  For  certain  applications  that  require  very 
accurate  modeling  of  the  last  few  positions,  the  finite  version  of  CCM  could  be  employed  instead. 

Examination  curves  of  CCM  and  DCM  decrease  with  similar  exponential  rates.  Both  models 
are  based  on  the  cascade  assumption,  under  which  examination  is  a  linear  traverse  through  the 
rank.  The  examination  probability  derived  by  UBM  is  much  larger,  and  this  suggests  that  the 
absolute  value  of  document  relevance  derived  from  UBM  is  not  directly  comparable  to  that  from 
DCM  and  CCM.  Relevance  judgments  from  click  models  is  still  a  relative  measure. 
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A.6  Conclusion 


We  have  discussed  the  click  chain  model  (CCM),  which  features  Bayesian  modeling  and  in¬ 
ference  of  user-perceived  relevance.  Using  an  approximate  closed-form  representation  of  the 
relevance  posterior,  CCM  is  both  scalable  and  incremental,  perfectly  meeting  the  challenges 
posed  by  real  world  practice.  We  carry  out  a  set  of  extensive  experiments  with  various  evaluation 
metrics,  and  the  results  clearly  evidence  the  advancement  of  the  state-of-the-art. 
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Appendix  B 

RTW  Knowledge  Base  Query  Interface 


The  “Read  the  Web”  research  project  at  Carnegie  Mellon  University  [95]  has  made  available  an 
ever-updating  web  knowledge  base  which  is  made  up  millions  of  structured  beliefs  like  (Face- 
book,  Headquarteredln,  Palo_Alto).  This  part  of  the  appendix  describes  the  implementation  of  a 
more  user-friendly  online  interface  to  ease  the  browsing  process  and  perform  simple  proximity 
query.  The  data  set  herein  consists  of  0.4M  promoted  beliefs,  a  cleaner  version  of  the  raw  data 
set  in  the  same  format  as  the  aforementioned  triplet,  by  courtsey  of  Bryan  Kisiel  at  Carnegie 
Mellon  University. 

Figure  B.l  offers  a  snapshot  of  the  online  interface,  available  at  http  :  /  /www .  db  .  cs  . 
emu  .  edu/wk/.  It  supports  two  categories  of  user  actions: 

•  Fact  Browsing :  list  all  facts  for  a  given  query  word  and  its  role,  where  each  fact  could  be 
represented  as  a  triplet  of  (entity,  relationship,  value).  Figure  B.2(a)  highlights  a  browsing 
example. 

•  Similarity  Query,  list  relevant  information  for  a  given  a  query  word.  A  graph  is  constructed 
in  a  way  that  each  fact  in  the  knowledge  base  induces  an  edge  pointing  from  “entity”  to 
“value”.  Proximity  query  results  are  based  on  a  fast  simulation-based  approximation  of 
personalized  PageRank  [8].  Figure  B.2(b)  highlights  a  querying  example. 

The  implementation  of  the  interface  is  graphically  illustrated  in  Figure  B.3.  It  could  be 
roughly  divided  into  three  parts: 

•  Data  Store:  The  raw  data  set  is  stored  as  a  single  table  in  the  MySQL  database.  A  python 
script  is  written  to  generate  derived  data  tables  to  provide  more  friendly  interfaces  to  other 
parts  of  the  system. 

•  Query  Processing  Backend:  Upon  initialization,  it  connects  to  the  data  server  and  build 
up  in-memory  data  structures.  It  employs  Thrift  [105]  to  set  up  the  interface  between  the 
C++  backend  and  the  PHP  web  server.  Caching  of  recent  results  was  also  implemented 
to  reduce  the  response  time  and  provide  more  consistent  results  despite  the  randomness  of 
the  approximation  algorithm. 

•  Web  Client  and  Server:  The  client-side  collects  user  inputs  and  sends  to  the  server-side 
using  simple  Ajax  to  smooth  the  user  interaction  process  and  results  refreshing.  The 
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server-side  PHP  communicates  directly  to  the  MySQL  server  to  provide  typeahead-like 
query  suggestion.  Suggestions  for  single-character  inputs  are  hard-coded  to  eliminate  the 
possible  caveats  in  case  of  prolonged  ajax  response  time. 
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^  RTW  Knowledge  Base  Query  Interface 


This  is  a  simple  interface  to  query  the  knowledge  base  from  Read  The  Web  (RTW)  Project. 
which  consists  millions  of  (entity,  relation,  value)  triplets,  also  known  as  facts,  such  as 
("Pittsburgh'1,  "citylocatedinstate”.  "Pennsylvania").  It  supports  the  following  two  functions: 

1.  Fact  Browsing:  list  all  facts  for  a  given  query  word  and  its  role: 

2.  Similarity  Query’:  based  on  a  fast  approximation  of  personalized  pagerank. 


Query’  word:  □  Show  query’  suggestion 


Choose  a  task  for  the  query  word: 

. (\. . ]  T 

1 

Fact  Browsing 
Similarity  Query 

Query  Results: 

Figure  B.l:  A  snapshot  of  the  online  query  interface. 


Query  word:  Pittsburgh  0  Show  query  suggestion  Query  word:  perth  D  Show  query  suggestion 

Candidate  Words:  Pittsburgh,  pittsburghsteelers.  pittsburgh_pirates,  pittsburgh_penguins,  pittsburgh_post_gazette  Candidate  Words,  perth,  Perthshire,  perth_ambov,  perth_mint,  perthes — disease 


Choose  a  task  for  the  query  word:  Fact  Browsing 
Choose  a  role:  iii  Entity  ©  Relation  ©Value 


Choose  a  task  for  the  query  word:  Similarity  Query  » 

Max  number  of  results:  10  V  Show  more  parameters 

Number  of  Sampling  Paths:  10000  Probability  of  Restart:  0.25 


Query  Results: 


Query  Results: 


Time  elapsed:  0.022  seconds. 


Number  of  results:  278. 


[Entity  [Relation 

[Value 

[pittsburgh  [acquired 

carrick 

[pittsburgh  [actorstarredinmovie 

|city 

[pittsburgh  [actorstarredinmovie 

[juiy 

[pittsburgh  [agentcoDaboraleswithagent 

[roethHsberger 

(a)  Fact  Browsing 


Time  elapsed:  0.398  seconds. 

|Rank  [" 


|  Score  P 


|  1  |westem_australia  |  0.0285  |perth< - (statehascapital}< - westem_austrafa 

1  2 

city  1 0.0264 

perth- >(hasofEceincity)->city 

1  3 

austraUa  [  0.0212 

perth<-(statehascapital)<-australia 

1  4 

Scotland  [  0.019  [perth<-(statehascapitaI)<-scotland 

|  5 

skywest |  0.0183  [perth- >(citvloc atedinc ountrv) ->skywest 

1  6 

swan  |  0.0167  |perth->(dtyk>catedinstate)->swan 

|  7 

tay  |  0.0152  |perth->(cityKesonriver)->tay 

(b)  Similarity  Query 


Figure  B.2:  Illustration  of  user  actions  supported  by  the  query  interface. 
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Figure  B.3:  A  graphical  illustration  of  the  interface. 
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